This is part 2 of the blog entry "SDTM in non-submission Research - some Thoughts on Best Practices". Part 1 concentrated on
- No need for SAS-XPT format
- Avoid Supplemental Qualifier Datasets
- Use of variable names
- Use of Test Codes
- Use CDISC Controlled Terminology
Units
CDISC developed a list of "units" to be used in
SDTM, this although there is an international notation for units, named UCUM,
and which is (among others) used a lot in healthcare. Also this has historical
reasons. "The CDISC Controlled Terminology Team" at that time decided
against UCUM as "researchers are not familiar with UCUM notation".
I cite: "UCUM expressions, in order to support computability, represent familiar units in unfamiliar ways, with curly brackets and and other symbols. This is off-putting to some users".
That was however at a time that most studies were paper studies. Times have changed. Especially for academic research and observational studies, a lot of data will come from electronic health records (EHRs) and other electronic sources that usually use UCUM notation. Modern lab apparatus also use UCUM notation. I haven't seen a single lab apparatus that consequently uses CDISC units.
That was however at a time that most studies were paper studies. Times have changed. Especially for academic research and observational studies, a lot of data will come from electronic health records (EHRs) and other electronic sources that usually use UCUM notation. Modern lab apparatus also use UCUM notation. I haven't seen a single lab apparatus that consequently uses CDISC units.
Unfortunately, SDTM does not allow UCUM notation for units
yet, but requires a unit from the CDISC list (though there is some overlap).
CDISC units do not allow unit conversions, whereas allowing automated unit
conversion was one of the design features of UCUM. There are even free RESTfulweb services for UCUM unit conversions that could also be used in FDA/PMDA tools.
A "mapping list" between some of the CDISC units and the UCUM notation has been published by CDISC but (in contradiction with what some at CDISC state), this mapping was never
intended to automate UCUM transformation to CDISC units: it was meant by the
developers (including myself) to make it possible to transform "CDISC
units" to UCUM notation.
So, if you do already obtain results with the units using
the UCUM notation (EHRs, lab apparatus, …), I
would recommend to use these, and NOT to try to convert these to "CDISC
units", when you are not going to submit to FDA or PMDA. In case of a
regulatory submission, you will (unfortunately still) need to transform your UCUM
units to CDISC units, which will not only be tedious, but also error prone.
When comparing with other studies that do use CDISC units, automated conversion
from CDISC to UCUM can be accomplished very easily for these other studies, so
that all your studies to be compared use UCUM notation. This has the additional
advantage of being able to perform automated unit conversions to compare
results between the studies, which is not possible when sticking to CDISC
units.
Data redundancy and
other peculiarities
SDTM has a lot of data redundancy. This may at first
sound strange for (data) scientists, but we should not forget that SDTM is nota database, it is a "view" on a database,
but with the (source) database not visible anymore, i.e. tracing back to the
original data is not easily accomplished. The regulatory authorities such as
the FDA are however not interested in the source data. This is also why the
process of generating SDTM is often designated as an ETL("extract-transform-load") process,
as is used for data warehouses.
Furthermore, SDTM tables mostly are
"entity-attribute-value" tables (hypervertical tables) which often require "transposing" when generating the tables. Data
redundancy in SDTM is mostly historically due to repeated requests of FDA
reviewers to add new, but derived variables, making it easier for them to
review the submission without additional programming ("for ease of review"). So SDTM is not suitable for operational purposes.
When using SDTM for non-submission purposes, one can use these "features" for "ease of review" too. Data redundancy however has inherent dangers, as every data scientist knows, so good quality assurance is very important. As we all know: "garbage in, garbage out".
When using SDTM for non-submission purposes, one can use these "features" for "ease of review" too. Data redundancy however has inherent dangers, as every data scientist knows, so good quality assurance is very important. As we all know: "garbage in, garbage out".
Sometimes SDTM has its own peculiarities. For example, when
deriving the "study day", one must take into account that there is no
day "zero" in SDTM: the first day of the study is day "1"
(usually the day of informed consent or of the screening, that's your own
choice – another peculiarity), and
the day before day "1" is … day
"-1" (and not day "0"). The math I learned at primary school looks different ;-).
Validation
Unfortunately, CDISC was late in publishing validation rules for its standards. This shortcoming was picked up by a commercial company, making validation software available implementing the companies own interpretation of the SDTM-IG. When then later the FDA started using the software, this validation software became the industry "gold standard", this although it was completely out of the control of CDISC.
A recent publication of CDISC staff members lists the challenges of using this software for observational studies (which it was never designed for). But wouldn't it be better to not use this software at all for non-submission studies, not only as it is very buggy (many false positive errors), but also as is it is heavily influence by FDA/PMDA requirements and by SDTM requirements that are in origin FDA requirements?
Such a distinction between regulatory and non-regulatory conformance rules could easily be made as part of the "Open Rules for CDISC Standards" initiative, where all the (human AND machine-readable) rules are under the control of CDISC.
Validation
Unfortunately, CDISC was late in publishing validation rules for its standards. This shortcoming was picked up by a commercial company, making validation software available implementing the companies own interpretation of the SDTM-IG. When then later the FDA started using the software, this validation software became the industry "gold standard", this although it was completely out of the control of CDISC.
A recent publication of CDISC staff members lists the challenges of using this software for observational studies (which it was never designed for). But wouldn't it be better to not use this software at all for non-submission studies, not only as it is very buggy (many false positive errors), but also as is it is heavily influence by FDA/PMDA requirements and by SDTM requirements that are in origin FDA requirements?
Such a distinction between regulatory and non-regulatory conformance rules could easily be made as part of the "Open Rules for CDISC Standards" initiative, where all the (human AND machine-readable) rules are under the control of CDISC.
FDA requirements
Very often (and unfortunate), the FDA has additional
requirements which sometimes even conflict with the SDTM-IG. These additional
requirements are usually described in the "Study Data TechnicalConformance Guide".
In case you are sure that your datasets will never be submitted to the FDA, you
can of course completely ignore the "Study Data Technical Conformance
Guide". Even when you believe that one day or another, your SDTM datasets
may need to be submitted to the FDA, you need to know that these additional FDA
requirements change regularly and that it is hard to guess whether a current
requirement will be there in future too.
Conclusions and further considerations
Using SDTM for non-submission studies certainly has great advantages: SDTM brings order in chaos - it allows consequent categorization, allowing to compare different studies from different sources. For this reason, SDTM is also more and more used in observational studies.
However, SDTM has always been developed with the primary purpose of regulatory submissions, and is strongly influenced by especially FDA requirements ("ease of review"). It was not developed for observational studies, and as the FDA was sofar not very interested in observational studies, there is no to little influence of stakeholders in the world of observational studies.
This brings us to the following proposition: In the SDTM Model and SDTM Implementation Guide, wouldn't it be a good idea to indicate which variables and requirements are "regulatory" (FDA and PMDA), and indicate how these variables and requirements can/must be handled (or omitted) in the case of non-regulatory studies?
For example, shouldn't the SDTM-IG state that the requirement of SAS-XPT, 8-character names and codes is solely a regulatory requirement and is not valid in case of a non-regulatory submission?
Or shouldn't it explicitely state that UCUM notation is allowed for the non-submission case, and that only for a regulatory submission, units must be converted to CDISC-CT units?
This would bring more clarity, also for non-submission and observational studies, and is in accordance with the new CDISC moto: "CDISC creates clarity".
For example, shouldn't the SDTM-IG state that the requirement of SAS-XPT, 8-character names and codes is solely a regulatory requirement and is not valid in case of a non-regulatory submission?
Or shouldn't it explicitely state that UCUM notation is allowed for the non-submission case, and that only for a regulatory submission, units must be converted to CDISC-CT units?
This would bring more clarity, also for non-submission and observational studies, and is in accordance with the new CDISC moto: "CDISC creates clarity".
Of course it would be better that regulatory authorities modernize their acceptance and review methods, so that many of these "regulatory limitations" disappear. I am however afraid that that will still take some time.