Saturday, March 30, 2019

SDTM in non-submission Research: Some Thoughts on Best Practices


Disclaimer: this is Jozef's personal opinion. It is not necessarily the opinion of CDISC. Some SDTM gurus may completely disagree with what I write down here

Background


CDISC SDTM is also more and more used in academic research and other clinical studies of which the results are not submitted to regulatory authorities such as the FDA and PMDA. A very good reason for using SDTM in non-submission studies is that SDTM allows to categorize clinical research data so that data from different studies can be compared in a much better way than from the source data directly. This interest for SDTM for non-submission studies is also reflected in two sessions at the upcoming CDISC EuropeanInterchange in Amsterdam on observational studies and "real world evidence". 
"Normal" SDTM studies (to be submitted to FDA or PMDA) of course have a number of requirements that do not apply to non-submission studies. This blog highlights some of them and provides some "best practices" (in my own personal opinion), which should make it easier for SDTM implementers who do not need (or want) to submit to regulatory authorities.
Part 1 will concentrate on SAS-XPT format, Supplemental Qualifier datasets, use of variable names and non-standard variables, use of test codes and CDISC controlled terminology.
Part 2 (published next week) will concentrate on units, data redundancy and other peculiarities of SDTM, and special FDA requirements.


Observational and Interventional studies

CDISC SDTM can as well be used for interventional (the classic use case) as for observational studies. In observational studies, some of the domains like IE, EX,AE, DV, TV may be absent and the DV domain will be used differently, and Custom domains will probably be necessary. "Visits" (reflected in SDTM variables VISITNUM and VISIT) may not apply. Unplanned visits will be overnumbering planned visits.
An interesting presentation about experiences of using SDTM in observational studies was given by Jon Neville and Bess LeRoy of CDISC at the last Phuse USConnect. It shows how SDTM was used to order and categorize observational data for a number of observational studies. The corresponding excellent paper can be found here.

In this blog, we will however concentrate on more technical aspects for non-submission studies, that can as well be interventional as observational studies. 


No need for SAS-XPT format

SDTM submissions to regulatory authorities need to be done using the SAS Transport 5 ("SAS XPT" format), a 30-year, completely outdated, binary format, with a large number of limitations. Although better alternatives have been proposed by CDISC, the FDA still mandates the use of this outdated XPT format. If, however, no submission is intended, there is no logical reason why SDTM implementers should use the XPT format at all. XPT has a large number of limitations, such as the 8-character limitation for variable names, the 40-character limitation for variable labels, and the 200-character limitation for values. Furthermore, SAS-XPT only supports US-ASCII characters, so that it cannot be easily (or not at all be) used with other languages than English. In my opinion, the FDA requirement for XPT even discriminates Spanish speaking citizens of the USA, as there is no support for non-ASCII Spanish characters such as the "inverted question mark". This means that e.g. questionnaire questions in the Spanish language cannot be correctly represented in SAS XPT. And then we still do not speak about "Scandinavian" characters, and surely not about Japanese, Chinese, Korean

These limitations are also described in the SDTM Model and Implementation Guides ("IGs"). Having such limitations in the model is in my opinion very wrong, as SDTM is in principle a "semantic standard" which should be independent of the "transport format". The reason for merging "semantic" and "transport" is of course a historical one. Other organizations in healthcare have resolved this in a much better way. For example, HL7 FHIR gives you the choice between 3 transport formats: JSON, XML and RDF, but when reading the FHIR specification, you do not see anything of this, you only notice it when examining the examples.

So, in non-submission studies, I would recommend to not use the XPT format, but to choose a modern format without these limitations, such as an XML format (CDISC Dataset-XML may be a very good choice) , JSON, any of the RDF-implementations. Even better is to not have to use a transport format for storage (as the FDA is doing), but to keep the SDTM data in a database, in which the variable names, labels and values of course should of course not be limited to 8, 40 and 200 characters. For example, we already have seen SDTM datasets being stored as CDISC Dataset-XML in a native XML database such as eXist and baseX, but also relational databases are of course very common.

Avoid Supplemental Qualifier Datasets

SDTM "Supplemental Qualifier datasets" have, for a part, been invented to overcome the limitations of the SAS XPT format. For example, if you have a value longer than 200 characters, the SDTM-IG (SDTM Implementation Guide) forces you to split the text of the values in <=200 character chunks, and to generate a record in a separate SUPPxx dataset (where "xx" is the name of the "parent" domain) for each of the chunks (except for the first one), and in the SUPPxx dataset "link" the records there back to the "parent" dataset with help of the "xxSEQ", which is the "sequence number" in the parent record. This is due to the 200-character limitation of SAS-XPT.
A practical problem arises here, as the rules around the "xxSEQ" have the consequence that it is calculated at the end of the dataset generation process, with the implication that any change or addition to an SDTM dataset requires the recalculation of the "xxSEQ", and thus also a complete re-generation of the SUPPxx dataset. Such a "split" in 200-character chunks is of course contra-productive when using 21st century IT, especially as SUPPxx datasets are extremely inefficient and very difficult to handle when doing analysis of the data.

Therefore, most vendors of SDTM mapping software, like our own SDTM-ETL software allow to have variable values >200 characters, and then, in the very very last step, when the "FDA checkbox" is checked, do these splits, calculate the "xxSEQ" and generate the SUPPxx datasets. In case no FDA submission is envisaged, the datasets are generated in CDISC Dataset-XML. Also other mapping tools from other vendors use such a strategy.

The same applies to so-called "Non-Standard Variables" (NSVs). The SDTM model and Implementation Guide defines a number of variables (the "standard variables") but allows to add other variables (sometimes also named "sponsor-defined variables") which are the NSVs. The values for these NSVs are however not allowed to reside in the usual datasets, but need (again) go into a "supplemental qualifier" dataset SUPPxx, and linked back to the "parent" record by the xxSEQ sequence number. The background of this is probably that FDA reviewers are unable to distinguish between "standard" and "non-standard" variables due to lack of knowledge of the SDTM standard.
The FDA even has a (in my opinion ridiculous) rule that it requires to submit an NSV "AETRTEM" (labeled "Adverse Event Treatment Emergent") which MUST go into the SUPPAE dataset.

In a non-submission case, there is of course no reason to "ban" NSV data into separate datasets, as any modern SDTM software can immediately recognize whether a variable is a "standard" or "non-standard" variable. There are even RESTful web services for thispurpose, and very soon, also the RESTful web services of the "CDISC Library" will allow to quickly find out whether a variable is "standard" or an NSV. Furthermore, one can easily "mark" NSVs as such in the define.xml, which is an XML file containing the metadata of the SDTM datasets. In case the SDTM data is stored in a database, there is of course not a single reason at all to use supplemental qualifiers in the non-submission case. Even in the submission case, supplemental qualifiers are very often simply stored as normal variables in the "parent" dataset or database tables, and "split off" at the very last moment when the XPT files need to be generated and the "FDA checkbox" is checked.


Use of variable names

In SDTM, variable names are limited to 8 characters, may not start with a number, and may only contain alphanumeric (US-ASCII) characters. When categorizing non-submission data into SDTM, one should of course try to categorize the data using the existing SDTM variables (and stick to their name), however without "brute force". It will however often be necessary to define additional variables (NSVs). In such a case, one can of course stick to the 8-character rule, but in the non-submission case, there is no absolute necessity for this. Due to this 8-character rule, more and more SDTM variables unfortunately have a name that is not mnemonic at all, such as "TUACPTFL" (as a researcher in academia, do you know what it means?). So, for non-standard variables, feel free to use your own, meaningful, variable name (which you can start with a number, contain non-US-ASCII, non-alphanumeric characters and be longer than 8 characters), such as "DATE_OF_FIRST_DIAGNOSIS". The same of course applies to variable labels: as you won't be using SAS-XPT, there is no reason not to go beyond the 40-character limitation. Be aware however that in such a case, transforming your datasets back to the "regulatory submission case", with all its rules, may be very time-consuming.
But once again, if your data fits an existing SDTM variable, use that, and leave the name and label of that variable unchanged, even when it is not very mnemonic.

Use of Test Codes

A similar rule applies to values of test codes (xxTESTCD): they may not be longer than 8 characters, not start with a number, and only contain alphanumeric (US-ASCII) characters. This immediately excludes SNOMED-CT and LOINC test codes. For example, if your unique identifier for your test is the LOINC code "1751-7", describing a quantitative albumin test in serum/plasma, you are essentially not allowed to use it as a value in "LBTESTCD", as it starts with a number. You are even not allowed to use "LN1751-7" as it contains a dash. Please also take into account that in many of the SDTM domains, the xxTESTCD does NOT uniquely describe the test due to its post-coordinated nature, but only part of it. In LB ("Laboratory") for example, it only defines the analyte (e.g. "GLUC" for "Glucose"). If you want to use a LOINC code as identifier, you should use the "LBLOINC" variable instead. If you want to use a SNOMED-CT code (e.g. because that is the way it is used as the test identifier in the source electronic health record (EHR)), there is however nothing like a "LBSNOMED" or "VSSNOMED" variable.  In such a case, you will normally need to map your SNOMED-CT code to the combination of xxTESTCD, xxTEST, xxSPEC, xxMETHOD (where xx is the SDTM "domain", like "LB" or "VS" or any other of the "Findings" domains). CDISC is currently developing such a mapping from the most used LOINC codes to LB (laboratory) test codes, but if your identifier is e.g. a SNOMED-CT code, you are on your own. Creating a "non-standard variable" without banning it to a "SUPPQUAL" dataset may then be a good idea.
Let us take an example: you get the data from an EHR, and the provided SNOMED-CT code is "105723007" named "body temperature". As all your data comes from the same EHR system (which even can be a national EHR system), you might be tempted to put "105723007" in VSTESTCD., as it is clear to you that this is the "test code". According to SDTM, this is however not allowed, as first of all, VSTESTCD is governed by CDISC controlled terminology, and secondly as "105723007" cannot be valid test code according to SDTM as it starts with a number, and as it is longer than 8 characters. So, you will need to create a "non-standard variable" like "VSSNOMEDCT" and put "105723007" in there (and document in the define.xml that the "external"codelist SNOMED-CT was used). You will then need to put the equivalent CDISC term "TEMP" in VSTESTCD and "Temperature" into VSTEST. As there is no mapping available from SNOMED-CT to CDISC test codes, this may be extremely hard to automate. Does this make sense when you are not going to submit to FDA or PMDA? Do you want to develop this mapping from SNOMED-CT to CDISC-CT just for the sake of SDTM compliance? Or would you just ignore the SDTM requirements about test codes and put "105723007" in VSTESTCD and state in the define.xml that VSTESTCD values come from SNOMED-CT? The choice is up to you. 

Use CDISC Controlled Terminology


CDISC has developed a lot of controlled terminology not only for submissions, but also to be used already at study design time, like for CDISC-CDASH. Already using CDISC controlled terminology at study design is of utmost importance, as it will later allow to generate SDTM in a much easier way. For example, for "severity of adverse event", CDISC has "mild", "moderate" and "severe". If you use four grades of severity in your study design, it may become difficult to map this to SDTM in a way that your results will be comparable to other studies that follow CDISC controlled terminology.

And if you do not follow CDISC controlled terminology, e.g. as it does not fit for scientific reasons, please be sure to add and describe your own codelist in the define.xml. Usually this is a simple copy-past from your CDISC-ODM that contains your study design.

Part 2 will be published next week.