Saturday, April 6, 2019

SDTM in non-submission Research (Part 2): Some Thoughts on Best Practices

Disclaimer: this is Jozef's personal opinion. It is not necessarily the opinion of CDISC. Some SDTM gurus may completely disagree with what I write down here

This is part 2 of the blog entry "SDTM in non-submission Research - some Thoughts on Best Practices". Part 1 concentrated on
  • No need for SAS-XPT format
  • Avoid Supplemental Qualifier Datasets
  • Use of variable names
  • Use of Test Codes
  • Use CDISC Controlled Terminology 
The second part will concentrate on other important topics such as use of units, data redundancy and other SDTM peculiarities, validation, and additional FDA requirements.

Units


CDISC developed a list of "units" to be used in SDTM, this although there is an international notation for units, named UCUM, and which is (among others) used a lot in healthcare. Also this has historical reasons. "The CDISC Controlled Terminology Team" at that time decided against UCUM as "researchers are not familiar with UCUM notation". I cite: "UCUM expressions, in order to support computability, represent familiar units in unfamiliar ways, with curly brackets and and other symbols. This is off-putting to some users".
That was however at a time that most studies were paper studies. Times have changed. Especially for academic research and observational studies, a lot of data will come from electronic health records (EHRs) and other electronic sources that usually use UCUM notation. Modern lab apparatus also use UCUM notation. I haven't seen a single lab apparatus that consequently uses CDISC units.


Unfortunately, SDTM does not allow UCUM notation for units yet, but requires a unit from the CDISC list (though there is some overlap). CDISC units do not allow unit conversions, whereas allowing automated unit conversion was one of the design features of UCUM. There are even free RESTfulweb services for UCUM unit conversions that could also be used in FDA/PMDA tools. 

A "mapping list" between some of the CDISC units and the UCUM notation has been published by CDISC but (in contradiction with what some at CDISC state), this mapping was never intended to automate UCUM transformation to CDISC units: it was meant by the developers (including myself) to make it possible to transform "CDISC units" to UCUM notation.

So, if you do already obtain results with the units using the UCUM notation (EHRs, lab apparatus, ), I would recommend to use these, and NOT to try to convert these to "CDISC units", when you are not going to submit to FDA or PMDA. In case of a regulatory submission, you will (unfortunately still) need to transform your UCUM units to CDISC units, which will not only be tedious, but also error prone. When comparing with other studies that do use CDISC units, automated conversion from CDISC to UCUM can be accomplished very easily for these other studies, so that all your studies to be compared use UCUM notation. This has the additional advantage of being able to perform automated unit conversions to compare results between the studies, which is not possible when sticking to CDISC units.


Data redundancy and other peculiarities
 

SDTM has a lot of data redundancy. This may at first sound strange for (data) scientists, but we should not forget that SDTM is nota database, it is a "view" on a database, but with the (source) database not visible anymore, i.e. tracing back to the original data is not easily accomplished. The regulatory authorities such as the FDA are however not interested in the source data. This is also why the process of generating SDTM is often designated as an ETL("extract-transform-load") process, as is used for data warehouses.

Furthermore, SDTM tables mostly are "entity-attribute-value" tables (hypervertical tables) which often require "transposing" when generating the tables. Data redundancy in SDTM is mostly historically due to repeated requests of FDA reviewers to add new, but derived variables, making it easier for them to review the submission without additional programming ("for ease of review"). So SDTM is not suitable for operational purposes.
When using SDTM for non-submission purposes, one can use these "features" for "ease of review" too. Data redundancy however has inherent dangers, as every data scientist knows, so good quality assurance is very important. As we all know: "garbage in, garbage out".
Sometimes SDTM has its own peculiarities. For example, when deriving the "study day", one must take into account that there is no day "zero" in SDTM: the first day of the study is day "1" (usually the day of informed consent or of the screening, that's your own choice another peculiarity), and the day before day "1" is day "-1" (and not day "0"). The math I learned at primary school looks different ;-).

Validation

Unfortunately,  CDISC was late in publishing validation rules for its standards. This shortcoming was picked up by a commercial company, making validation software available implementing the companies own interpretation of the SDTM-IG. When then later the FDA started using the software, this validation software became the industry "gold standard", this although it was completely out of the control of CDISC.
A recent publication of CDISC staff members lists the challenges of using this software for observational studies (which it was never designed for). But wouldn't it be better to not use this software at all for non-submission studies, not only as it is very buggy (many false positive errors), but also as is it is heavily influence by FDA/PMDA requirements and by SDTM requirements that are in origin FDA requirements?
Such a distinction between regulatory and non-regulatory conformance rules could easily be made as part of the "Open Rules for CDISC Standards" initiative, where all the (human AND machine-readable) rules are under the control of CDISC.


FDA requirements

Very often (and unfortunate), the FDA has additional requirements which sometimes even conflict with the SDTM-IG. These additional requirements are usually described in the "Study Data TechnicalConformance Guide". In case you are sure that your datasets will never be submitted to the FDA, you can of course completely ignore the "Study Data Technical Conformance Guide". Even when you believe that one day or another, your SDTM datasets may need to be submitted to the FDA, you need to know that these additional FDA requirements change regularly and that it is hard to guess whether a current requirement will be there in future too.

Conclusions and further considerations

Using SDTM for non-submission studies certainly has great advantages: SDTM brings order in chaos - it allows consequent categorization, allowing to compare different studies from different sources. For this reason, SDTM is also more and more used in observational studies. 
However, SDTM has always been developed with the primary purpose of regulatory submissions, and is strongly influenced by especially FDA requirements ("ease of review"). It was not developed for observational studies, and as the FDA was sofar not very interested in observational studies, there is no to little influence of stakeholders in the world of observational studies.

This brings us to the following proposition: In the SDTM Model and SDTM Implementation Guide, wouldn't it be a good idea to indicate which variables and requirements are "regulatory" (FDA and PMDA), and indicate how these variables and requirements can/must be handled (or omitted) in the case of non-regulatory studies?
For example, shouldn't the SDTM-IG state that the requirement of SAS-XPT, 8-character names and codes is solely a regulatory requirement and is not valid in case of a non-regulatory submission?
Or shouldn't it explicitely state that UCUM notation is allowed for the non-submission case, and that only for a regulatory submission, units must be converted to CDISC-CT units?
This would bring more clarity, also for non-submission and observational studies, and is in accordance with the new CDISC moto: "CDISC creates clarity".
 
Of course it would be better that regulatory authorities modernize their acceptance and review methods, so that many of these "regulatory limitations" disappear. I am however afraid that that will still take some time.