Disclaimer: this is Jozef's personal opinion. It is not
necessarily the opinion of CDISC. Some SDTM gurus may completely disagree with
what I write down here …
Background
CDISC SDTM is also more and more used in academic research
and other clinical studies of which the results are not submitted to regulatory
authorities such as the FDA and PMDA. A very good reason for using SDTM in
non-submission studies is that SDTM allows to categorize clinical research data
so that data from different studies can be compared in a much better way than
from the source data directly. This interest for SDTM for non-submission
studies is also reflected in two sessions at the upcoming CDISC EuropeanInterchange in Amsterdam on observational studies and "real world evidence".
"Normal" SDTM studies (to be submitted to FDA or PMDA) of course have
a number of requirements that do not apply to non-submission studies. This blog
highlights some of them and provides some "best practices" (in my own
personal opinion), which should make it easier for SDTM implementers who do not
need (or want) to submit to regulatory authorities.
Part 1 will concentrate on SAS-XPT format, Supplemental Qualifier datasets, use of variable names and non-standard variables, use of test codes and CDISC controlled terminology.
Part 1 will concentrate on SAS-XPT format, Supplemental Qualifier datasets, use of variable names and non-standard variables, use of test codes and CDISC controlled terminology.
Part 2 (published next week) will concentrate on units, data redundancy and
other peculiarities of SDTM, and special FDA requirements.
Observational and
Interventional studies
CDISC SDTM can as well be used for interventional (the classic use case) as for observational studies. In observational studies, some of the domains like IE, EX,AE, DV, TV may be absent and the DV domain will be used differently, and Custom domains will probably be necessary. "Visits" (reflected in SDTM variables VISITNUM and VISIT) may not apply. Unplanned visits will be overnumbering planned visits.
An interesting presentation about experiences of using SDTM in observational
studies was given by Jon Neville and Bess LeRoy of CDISC at the last Phuse USConnect. It shows how
SDTM was used to order and categorize observational data for a number of
observational studies. The corresponding excellent paper can be found here.
In this blog, we will however concentrate on more technical
aspects for non-submission studies, that can as well be interventional as
observational studies.
No need for SAS-XPT
format
SDTM submissions to regulatory authorities need to be done using the SAS Transport 5 ("SAS XPT" format), a 30-year, completely outdated, binary format, with a large number of limitations. Although better alternatives have been proposed by CDISC, the FDA still mandates the use of this outdated XPT format. If, however, no submission is intended, there is no logical reason why SDTM implementers should use the XPT format at all. XPT has a large number of limitations, such as the 8-character limitation for variable names, the 40-character limitation for variable labels, and the 200-character limitation for values. Furthermore, SAS-XPT only supports US-ASCII characters, so that it cannot be easily (or not at all be) used with other languages than English. In my opinion, the FDA requirement for XPT even discriminates Spanish speaking citizens of the USA, as there is no support for non-ASCII Spanish characters such as the "inverted question mark". This means that e.g. questionnaire questions in the Spanish language cannot be correctly represented in SAS XPT. And then we still do not speak about "Scandinavian" characters, and surely not about Japanese, Chinese, Korean …
These limitations are also described in the SDTM Model
and Implementation Guides
("IGs"). Having such limitations in the model is in my opinion very wrong,
as SDTM is in principle a "semantic standard" which should be
independent of the "transport format". The reason for merging
"semantic" and "transport" is of course a historical one. Other
organizations in healthcare have resolved this in a much better way. For
example, HL7 FHIR gives you the choice between 3 transport formats: JSON, XML
and RDF, but when reading the FHIR specification,
you do not see anything of this, you only notice it when examining the
examples.
So, in non-submission studies, I would recommend to not use
the XPT format, but to choose a modern format without these limitations, such
as an XML format (CDISC Dataset-XML may be a very good choice)
, JSON, any of the RDF-implementations.
Even better is to not have to use a transport format for storage (as the FDA is
doing), but to keep the SDTM data in a database, in which the variable names,
labels and values of course should of course not be limited to 8, 40 and 200
characters. For example, we already have seen SDTM datasets being stored as
CDISC Dataset-XML in a native XML database such as eXist and baseX, but also relational databases are of course very
common.
Avoid Supplemental
Qualifier Datasets
SDTM "Supplemental Qualifier datasets" have,
for a part, been invented to overcome the limitations of the SAS XPT format.
For example, if you have a value longer than 200 characters, the SDTM-IG (SDTM Implementation Guide) forces
you to split the text of the values in <=200 character chunks, and to
generate a record in a separate SUPPxx dataset (where "xx" is the name
of the "parent" domain) for each of the chunks (except for the first
one), and in the SUPPxx dataset "link" the records there back to the
"parent" dataset with help of the "xxSEQ", which is the
"sequence number" in the parent record. This is due to the 200-character
limitation of SAS-XPT.
A practical problem arises here, as the rules around the
"xxSEQ" have the consequence that it is calculated at the end of the
dataset generation process, with the implication that any change or addition to
an SDTM dataset requires the recalculation of the "xxSEQ", and thus
also a complete re-generation of the SUPPxx dataset. Such a "split"
in 200-character chunks is of course contra-productive when using 21st
century IT, especially as SUPPxx datasets are extremely inefficient and very difficult to handle when doing analysis of the data.
Therefore, most vendors of SDTM mapping software, like our
own SDTM-ETL software allow to have
variable values >200 characters, and then, in the very very last step, when
the "FDA checkbox" is checked, do these splits, calculate the
"xxSEQ" and generate the SUPPxx datasets. In case no FDA submission
is envisaged, the datasets are generated in CDISC Dataset-XML. Also other
mapping tools from other vendors use such a strategy.
The same applies to so-called "Non-Standard
Variables" (NSVs). The SDTM model and Implementation Guide defines a number
of variables (the "standard variables") but allows to add other
variables (sometimes also named "sponsor-defined variables") which
are the NSVs. The values for these NSVs are however not allowed to reside in
the usual datasets, but need (again) go into a "supplemental
qualifier" dataset SUPPxx, and linked back to the "parent"
record by the xxSEQ sequence number. The background of this is probably that
FDA reviewers are unable to distinguish between "standard" and
"non-standard" variables due to lack of knowledge of the SDTM
standard.
The FDA even has a (in my opinion ridiculous) rule that it requires to submit an NSV "AETRTEM" (labeled "Adverse Event Treatment Emergent") which MUST go into the SUPPAE dataset.
The FDA even has a (in my opinion ridiculous) rule that it requires to submit an NSV "AETRTEM" (labeled "Adverse Event Treatment Emergent") which MUST go into the SUPPAE dataset.
In a non-submission case, there is of course no reason to
"ban" NSV data into separate datasets, as any modern SDTM software
can immediately recognize whether a variable is a "standard" or
"non-standard" variable. There are even RESTful web services for thispurpose,
and very soon, also the RESTful web services of the "CDISC Library"
will allow to quickly find out whether a variable is "standard" or an
NSV. Furthermore, one can easily "mark" NSVs as such in the define.xml, which is
an XML file containing the metadata of the SDTM datasets. In case the SDTM data
is stored in a database, there is of course not a single reason at all to use
supplemental qualifiers in the non-submission case. Even in the submission
case, supplemental qualifiers are very often simply stored as normal variables
in the "parent" dataset or database tables, and "split off"
at the very last moment when the XPT files need to be generated and the
"FDA checkbox" is checked.
Use of variable names
In SDTM, variable names are limited to 8 characters, may not
start with a number, and may only contain alphanumeric (US-ASCII) characters.
When categorizing non-submission data into SDTM, one should of course try to
categorize the data using the existing SDTM variables (and stick to their
name), however without "brute force". It will however often be
necessary to define additional variables (NSVs). In such a case, one can of
course stick to the 8-character rule, but in the non-submission case, there is
no absolute necessity for this. Due to this 8-character rule, more and more
SDTM variables unfortunately have a name that is not mnemonic at all, such as
"TUACPTFL" (as a researcher in academia, do you know what it means?).
So, for non-standard variables, feel free to use your own, meaningful, variable
name (which you can start with a number, contain non-US-ASCII, non-alphanumeric
characters and be longer than 8 characters), such as
"DATE_OF_FIRST_DIAGNOSIS". The same of course applies to variable labels:
as you won't be using SAS-XPT, there is no reason not to go beyond the
40-character limitation. Be aware however that in such a case, transforming
your datasets back to the "regulatory submission case", with all its
rules, may be very time-consuming.
But once again, if your data fits an existing SDTM variable, use that, and leave the name and label of that variable unchanged, even when it is not very mnemonic.
But once again, if your data fits an existing SDTM variable, use that, and leave the name and label of that variable unchanged, even when it is not very mnemonic.
Use of Test Codes
A similar rule applies to values of test codes
(xxTESTCD): they may not be longer than 8 characters, not start with a number,
and only contain alphanumeric (US-ASCII) characters. This immediately excludes
SNOMED-CT and LOINC test codes. For example, if your unique identifier for your
test is the LOINC code "1751-7",
describing a quantitative albumin test in serum/plasma, you are essentially not
allowed to use it as a value in "LBTESTCD", as it starts with a
number. You are even not allowed to use "LN1751-7" as it contains a
dash. Please also take into account that in many of the SDTM domains, the
xxTESTCD does NOT uniquely describe the test due to its post-coordinated nature,
but only part of it. In LB ("Laboratory") for example, it only
defines the analyte (e.g. "GLUC" for "Glucose"). If you
want to use a LOINC code as identifier, you should use the "LBLOINC"
variable instead. If you want to use a SNOMED-CT code (e.g. because that is the
way it is used as the test identifier in the source electronic health record
(EHR)), there is however nothing like a "LBSNOMED" or
"VSSNOMED" variable. In such a
case, you will normally need to map your SNOMED-CT code to the combination of
xxTESTCD, xxTEST, xxSPEC, xxMETHOD (where xx is the SDTM "domain",
like "LB" or "VS" or any other of the "Findings"
domains). CDISC is currently developing such a mapping from the most used LOINC
codes to LB (laboratory) test codes, but if your identifier is e.g. a SNOMED-CT
code, you are on your own. Creating a "non-standard variable" without
banning it to a "SUPPQUAL" dataset may then be a good idea.
Let us take an example: you get the data from an EHR, and
the provided SNOMED-CT code is "105723007"
named "body temperature". As all your data comes from the same EHR
system (which even can be a national EHR system), you might be tempted to put
"105723007" in VSTESTCD., as it is clear to you that this is the
"test code". According to SDTM, this is however not allowed, as first
of all, VSTESTCD is governed by CDISC controlled terminology, and secondly as
"105723007" cannot be valid test code according to SDTM as it starts
with a number, and as it is longer than 8 characters. So, you will need to
create a "non-standard variable" like "VSSNOMEDCT" and put
"105723007" in there (and document in the define.xml that the "external"codelist SNOMED-CT was used).
You will then need to put the equivalent CDISC term "TEMP" in
VSTESTCD and "Temperature" into VSTEST. As there is no mapping
available from SNOMED-CT to CDISC test codes, this may be extremely hard to
automate. Does this make sense when you are not going to submit to FDA or PMDA?
Do you want to develop this mapping from SNOMED-CT to CDISC-CT just for the
sake of SDTM compliance? Or would you just ignore the SDTM requirements about
test codes and put "105723007" in VSTESTCD and state in the
define.xml that VSTESTCD values come from SNOMED-CT? The choice is up to you.
Use CDISC Controlled Terminology
CDISC has developed a lot of controlled terminology not only for submissions, but also to be used already at study design time, like for CDISC-CDASH. Already using CDISC controlled terminology at study design is of utmost importance, as it will later allow to generate SDTM in a much easier way. For example, for "severity of adverse event", CDISC has "mild", "moderate" and "severe". If you use four grades of severity in your study design, it may become difficult to map this to SDTM in a way that your results will be comparable to other studies that follow CDISC controlled terminology.
Use CDISC Controlled Terminology
CDISC has developed a lot of controlled terminology not only for submissions, but also to be used already at study design time, like for CDISC-CDASH. Already using CDISC controlled terminology at study design is of utmost importance, as it will later allow to generate SDTM in a much easier way. For example, for "severity of adverse event", CDISC has "mild", "moderate" and "severe". If you use four grades of severity in your study design, it may become difficult to map this to SDTM in a way that your results will be comparable to other studies that follow CDISC controlled terminology.
And if you do not follow CDISC
controlled terminology, e.g. as it does not fit for scientific reasons, please
be sure to add and describe your own codelist in the define.xml. Usually this
is a simple copy-past from your CDISC-ODM that contains your study design.
Part 2 will be published next week.
Part 2 will be published next week.