Saturday, February 2, 2013

SDTM and non-standard variables

The new draft SDTM-IG 1.4 package 2 contains an interesting statement in the "SDS Proposal for alternate Handling of Supplemental Qualifiers" (file: SDS_Proposal-Alternate_Handling_of_Supplemental_Qualifiers.pdf) document. I cite:

"The Supplemental Qualifiers structure was created to address the need to represent NSVs in the SDTM. It consists of a normalized data structure designed to allow for standard representation of what often is a wide variety of sponsor-specific variables.
The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the NSVs together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers".

Where "NSV" means "Non-standard variables".

Essentially the above statement says that reviewers are not able to combine variables in SUPP-- domains with the parent domain and bring them "back" to the original domain.

I am not surprised.

In conversations with FDA representatives, I heard that SDTM data is viewed by reviewers "as is". Usually there is no attempt to generate a database from the submission datasets. Without that, it is not easy to recombine SUPP-- datasets with the parent domain.
Obviously, there is not much database knowledge at the FDA. I have always regarded their attempt to come to a datawarehouse as "a bridge too far". If they are not able to generate a database, how would they be able to generate a datawarehouse?

The proposed solution that is provided in the above mentioned document is not new: the ODM team already proposed something very similar five years ago, but the teams ideas for an XML-ODM based replacement of SAS Transport 5 were vetoed by the FDA.
I did already write about that in the blog "SDTM Supplemental Qualifiers".

Generating a database from the submission datasets is not made easy by the structure of the SDTM. I already wrote a previous blog on this topic titled "Is SDTM a database and if so - is it a good one".

At the beginning, SDTM was developed to contain "collected data only", and only a minimal amount of "derived data" would be present. The idea was that derived data would go into ADaM.
It however soon was found out that derived data would need to go into the datasets themselves. A simple example is "AGE" in the Demographics (DM) dataset.
AGE is in most cases never collected - it is derived (calculated) from BRTHDTC (birth date) and RFSTDTC (reference start date). Both these are present and required in the DM dataset. So why is AGE than still necessary?
IF SDTM were a database representation, "AGE" would even not be allowed, as it violates the "third normal form", stating that "transitive dependency" is not allowed.
However I suspect that, as it was soon found out that the FDA could not, or did not want to generate a database from each submission, "AGE" was included in the SDTM as a variable of the DM domain.

This turned SDTM into a "View" on a database.
Regenerating a real database from a view on it, is however not an easy task.

In the last few years, we have again and over again seen that new variables are added to the SDTM. In many cases these are derived variables.
So for example, SDTM 1.3 / SDTM-IG 3.1.3 adds following variables to the DM domain:
  • RFXSTDTC: date/time of first study treatment [can however also be obtained from the EX dataset]
  • RFXENDTC: date/time of last study treatment [can however also be obtained from the EX dataset]
  • RFICDTC: date/time of informed consent
  • RFPENDTC: date/time of end of participation [can very probably be obtained from SE (subject elements) and/or from DS (disposition)
  • DTHDTC: date/time of death: this is clearly derived - I cannot image this is a standard field on the Demographics form. I haven't seen it in CDASH either
  • DTHFL: death flag - similar
  • ACTARMCD: actual arm code - this really does not belong here. It should be a variable in the SE (subject elements) domain. But it looks as it was put in DM as either reviewers do not inspect the SE datasets and/or cannot combine it with the DM dataset.
The most obvious example of a derived variable is "--TEST". There is a 1:1 correspondence between "--TESTCD" and "--TEST", clearly violating the 3th normal form of database design. The somewhat crazy thing is that in define.xml this is not taken into account: there are separate codelists for "--TESTCD" and for "--TEST". As "--TEST" is a synonym qualifier for "--TESTCD", the correct construct should be like:
<CodeListItem CodedValue="SYSBP">
    <Decode><TranslatedText>Systolic blood pressure</TranslatedText></Decode>

and "--TEST" should even NOT be in the SDTM dataset.
In a tool that can as well look up the metadata as the data itself, the column "--TEST" would either be generated "on the fly" or the test name (value of --TEST) could e.g. appear as a tooltip when the mouse is positioned over the cell with the test code. With the ancient SASViewer this is of course not possible (as it even does not understand define.xml).

I also have always been puzzled about why we both have --STRESC and --STRESN. For --STRESN the IG says: "copied in numeric format from --STRESC". If the value in --STRESN is a copy of the value in --STRESC, why do we need it then? Maybe to explain the reviewer that the value is a numeric one? But that is already stated in the define.xml (ValueList / ItemDef) isn't it?

So why all these derived variables?

I suspect that the reason is very simple: the FDA is not able to do anything with the submitted datasets than to inspect them with the ancient SASViewer. They are not able to combine datasets with each other (generate views on views), they are not able to generate databases from them.
So each time they need a new "view" they just ask CDISC to define a new (derived) variable that is then added to the standard. The more information the FDA will want to retrieve from the datasets, the more new (derived) variables they will ask for to be included in the SDTM standard.
As can be expected, this will lead to even more "huge" datasets (as the number of variables is continously increasing), and we are then surprised that the FDA is complaining about file sizes!

Is there a way out of this?

I think there is once we have SDTM in XML (ODM-based) format. When both define.xml and SDTM datasets use the same model/format (ODM), we can generate software (e.g. viewers) that allow to "look up" information, or "calculate it on the fly".
For example, for the DM domain, suppose we leave "AGE" out of the dataset. The viewer can then take care that "AGE" is calculated "on the fly" and displayed either as an additional column (although it was not in the dataset), or as a tooltip for one of the cells (e.g. tooltip on USUBJID or on BRTHDTC).
Similarly, for the Findings domains, suppose we leave "--TEST" (test name) out of the dataset and only provide "--TESTCD". As both are connected to each other through the "code / decode" in the metadata, the column "--TEST" can automatically be generated when loading the data from the information in the define.xml. Or the value for --TEST can be supplied as a tooltip on the "--TESTCD" cell in the table.

Once we have such an XML-based format, we can start thinking about deprecating a number of SDTM variables, as it will essentially be the tool (e.g. the viewer) that takes care that they are derived, either from within the same dataset (e.g. AGE) as from other datasets (e.g. RFXSTDTC, RFXENDTC) as from the metadata in define.xml (e.g. --TEST - test name).
This will bring us to a much cleaner SDTM with much less redundant information, and considerably smaller files (I estimate a 50% reduction).

The ultimate solution is however SHARE. Once we can allocate each data point as being a SHARE object, things will become much easier.
There has been a lot of discussion about a format for SHARE. I am however more and more convinced that the format is completly unimportant. I am more and more convinced that SHARE objects can be expressed as RDF, as define.xml and ODM, as Excel (if you like that).

But that's a topic for another blog entry ...


  1. Jozef, I very much like the practical and clear examples in your postings. But help me understand what you expect from SHARE objects in the same practical and clear way. I don't think it's helpful to just say SHARE is the ultimate solution (sometime in the future)

  2. This comment has been removed by the author.

  3. Thanks Kerstin,

    A good tip!

    At the US Interchange, Dave IH gave a presentation showing how starting from a worksheet with SHARE information, the SDTM information (similar to the tables in the IG) can be generated automatically.
    My remark on that was that this can also be used to automatically generate define.xml templates. This would also help implementers to get rid of the copy-and-paste frustration they now have when implementing SDTM in software.

    SDTM is currently 2-dimensional. The tables have things like "grouping qualifiers", "synonym qualifiers", "variable qualifiers". Those having some knowledge of database design understand that the latter two should NOT be part of the tables, as they violate the third normal form.
    In XML, the "grouping qualifier" would just be a parent element, the "synonym qualifier" would not be part of the data, but would be part of the metadata (i.e. the define.xml), and "variable qualifier" would be an attribute to the result qualifier.
    So why don't we just do that! For example something (but don't pin me on the format!) like:

    <ItemData ItemOID="IT.VSORRES.SYSBP" Value="96" MeasurementUnitOID="MU.MMHG" IsBaselineFlag="Yes"/>


    <ItemData ItemOID="IT.VSORRES.DIABP" IsNull="Yes" ReasonNotDone="test was forgotten"/>

    We need no "Label" or "synonym" or so here as the meta-information is in the corresponding variable definition in the define.xml already.

    Begins to look like SHARE objects or not?

    1. Nice. I would like to take your good, practical example one step further and think beyond exchange of data and envision how clinical data can be instantiated from the beginning.

      When you have what is common for the Types of Observations (or, Research Concepts as I think the SHARE folks call them) described in a machine processable and linkable way you can use them to typify instances of clinical observations. For example an observation of Systolic Blood pressure in a study (D9999C99999) from a 'pharma' company would have a unique URI ("clickable data"). And it would be an instance (object) of a type of finding defined in SHARE ("clickable metadata").

      A simple RDF triple would express that relationship: rdf:type

      I started to do some exploratory work of this for my first CDISC presentation back in 2011 I used a patient data record ontology for my experiments (see and slide 40-41 in

      When we now see HL7 RIM and BRIDG models being published as ontologies we can explore if these can be used not only as the basis for exchange standards but also to instantiate data. However, what I have seen so far of the quite complicated extensions the GSK folks have proposed for SHARE to coop with how things have been modelled in BRIDG do make me a bit concerned.