Friday, June 16, 2017

SDTM, XPT and the constitution

Imagine that the constitution of your country would state "cars must be powered by gasoline".
Would you find that acceptable?
Now, we all know that most cars are powered by gasoline, but such a statement in the constitution would give electrical cars no chance at all, even when these are more friendly to the environment.

Something very similar happens at CDISC: the new SDTM Model v.1.6 (so not the Implementation Guide) has been written with only 1 implementation in mind: SAS-XPT format.

Standards models should be developed and published independent of the transport format. A very good example is HL7-FHIR for which there are three technical implementations: XML, JSON and RDF. The documentation has been published in a transport format independent way. It is only when you go to the examples (which you can consider as an Implementation Guide) that you will see something about the transport format.

So, as part of the public review of the SDTM Model v.1.6, I asked the SDTM team to change the text of the model in such a way that it is transport format neutral. This would then allow other transport formats such as XML (e.g. Dataset-XML), JSON and RDF for porting SDTM data in the future.

My request was turned down.
Here is the justification of the SDTM team (snapshot from the JIRA site):

"Considered for future" is the usual expression of the team for "refused".

This answer is indeed a "doom loop": it gives the FDA a reason for further refusing to allow a modern format. When asked about it, they can then say "we can't do that, it is not allowed by the SDTM model".

I have been observing in the last 5-10 years that the SDTM model and standard has been evolved in such a way that all first principles have been thrown away, such as avoidance of data redundancy, no derived data, and separation between model and implementation. This makes it more and more difficult to implement and ruins data quality. Essentially, one can say that it has been steered into a "dead end".

How can this be changed?
I must honestly say that I do not know the answer. "The train has left the station, but is it on the right track?" is a question that is even not posed within CDISC, and especially not within the SDTM team. Maybe the team needs some strong guidance itself, or responsibilities must be reassigned. There are some bright progressive people within CDISC but these are not involved in SDTM. Maybe it is time to give them the lead in SDTM development.

Friday, April 7, 2017

--LOBXFL can seriously damage your health

The addition of the new variable --LOBXFL (Last Observation before Exposure Flag) in SDTM 1.5 remains a controversial topic (as discussed here and here). According to the definition, --LOBXFL is "operationally derived", but the SDTM 1.5 specification does not say "how" it should be derived. There have been several complaints about this during the review period, but they were waved with the argument that they "should be addressed in any implementation guide". I am curious ...
My own request to "please provide guidance" was answered by:

 which I don't understand...

Now you may ask why I am so concerned about the addition of this new "derived" variable. Here are some issues:
  • derived variables should not appear in SDTM. Again, the SDTM team has given in on a request from the FDA caused by primitive and immature review tools used by some FDA reviewers
  • baseline flags should not appear in SDTM - they belong to ADaM
  • sponsors should not be asked to do the work of FDA reviewers - the latter have to make their own decisions of which of the data points is "the" baseline data point.
  • Assigning --LOBXFL is used to "camouflage" bad data quality. SDTM datasets with bad data quality should not be used in submissions and should not be accepted by the FDA.
Let me give an example.

The following is a snapshot of a VS dataset with measurements done on the date "2014-01-02" which is also the date of first exposure (according to EX, and to RFXSTDTC in DM - another unnecessary derived variable) using the open source "Smart Dataset-XML Viewer":

(remark that some columns have been swapped for better visibility)

According to the protocol, all vital signs measurements during this visit must be done before first drug intake. So the sponsor assigned the VSLOBXFL to the diastolic blood measurement with the value "76". What the sponsor however doesn't know, is that the researcher did the measurement immediately after the intake of the medication. As however too often, only the date (as well for the measurement as for the drug intake) was recorded, not the exact time.
Of course, the sponsor could also have assigned VSLOBXFL all three measurements on the date 2014-01-02, but as the standard does not specify "how" the derivation should be made ...
The same applies to the "PULSE" and "SYSBP" (systolic blood pressure) measurements:


If one inspects the data carefully, one will see that each of the "VSLOBXFL" records shows an increased value for that the specific measurent. This increased value may have been  caused by the intake of the study drug. However, this is not visible, nor detectable, as no times have been collected (as is very usual) for either the measurement as the drug intake. Even worse, the increased value is marked as the baseline value, which may mean that the reviewer, when looking at later data points, comes to the conclusion that the drug is lowering blood pressure and pulse, whereas it is exactly the inverse...

How does the "Smart Dataset-XML Viewer" deal with such a situation? 

One of the options of the "Smart Dataset-XML Viewer" is:

 When using it on a data point that is undoubtly (as it is on another day) the last measurement (for a specific test code) before first exposure, the viewer highlights the record:

When however the measurement is on the same day as the first exposure, and either the time part of the measurement or of the first exposure is not provided, the "Smart Dataset XML Viewer" will highlight the record or records and provide a warning:

 I pointed the SDTM team to all this in an additional review comment, which was answered as:

to which I responded:

Also the FDA reviewers have free access to the "Smart Dataset-XML Viewer", so they could use it too. On the other hand, the algorithm can also easily be implemented in SAS or any other modern review software.

As a conclusion, --LOBXFL is not only unnecessary, it also camouflages bad data quality. For reviewers, it is even potentially dangerous to trust on it as demonstrated above.
With --LOBXFL, it is just waiting for the first patient having his/her health seriously damaged ...

Sunday, March 5, 2017

Validating SDTM labels using RESTful web services

About a month ago, I reported about my first experiences with implementing the new CDISC SDTM-IG Conformance Rules. I now made considerable progress, having >60% of the rules implemented. These implementations are available for download and usage from here.

Today I want to elaborate a bit on how I implemented rule CG0303 "Variable Label = IG Label", using RESTful web services. Earlier implementations from others were based on copying/pasting the labels from the SDTM-IG and then hard-coding them in software. This does not only mean a lot of work, it is also error-prone, with the disadvantages that a software update is needed each time an error in the implementation is found. For example, if you search on the forum of the validation Software that the FDA is using for the wording "label mismatch" you will find many hits, especially about false positive errors. In some cases, one even gets an error on a label that looks 100% correct, but the software does not tell you what text for the label it expects. "Let the guessing begin"!
So we definitely need something better. Wouldn't it be better to use the SHARE content, load it into a central database, and query that database using a modern (easy-to-implement) RESTful web service?

That is exactly what we did. All SDTM-IG Information (from different IG versions) and all CDISC controlled terminology that is electronically available was loaded into a database, and RESTful web services were developed to make them available to anyone, and to any application. These RESTful web services (over 30 of them) are described here. Adding a new Service usually takes 1-2 hours, sometimes even less.

One of these services allows to retrieve all necessary information for a given variable in a given domain for a given SDTM-IG version. The RESTful query string description is:{sdtmigversion}/{domain}/{varname}

which is pretty self-explaining. For example, to get all the Information about the variable ECPORTOT in the domain EC for SDTM-IG 3.2, the query string is:

This service can now easily be used to validate labels in submissions, like in implementations of rule CG0303. Let's do so for a sample SDTM submission.
In our case, the SDTM submission resides in a native XML database (something the FDA SHOULD also do instead of messing around with SAS-XPT datasets). Here is the implementation of rule CG0303 in XQuery, an easy-to-learn language that is as well human-readable as machine-executable (so the rules are 100% transparent):

In the first part, the XML namespaces are declared and the location of the define.xml for this submission is set (usually this will be done by passing these as parameters from within the calling application). Also the base of the RESTful web Service is declared.

Here is the second part:

For each dataset in the submission (by iterating over all the dataset definitions "ItemGroupDef"), we get the domain name either from the dataset name or from the "Domain" Attribute in the define.xml (goes into $domain), and then start iterating over all the variables declared for the current dataset:

The variable name is obtained, and the label taken from the define.xml (remark that when using SDTM in XML, the label is in the define.xml and NOT in the dataset itself - which follows the good practice of separating data from metadata). The web service is then triggered returning the expected label from the database (can be SHARE in future), and the actual and expected label are compared. Remark that for some variables, there will not be a label from the SDTM-IG, as the variable is just not mentioned in the SDTM-IG, although it is allowed for that domain. In that case, there is nothing to compare.

If both Labels do not correspond, an error (in XML) is returned. An example is:

showing as well the actual as the expected label.

As the validation errors ("deviation" or "discrepancy" would in fact be a better word) come in XML, they can (unlike Excel or CSV) be used in many ways, and even ... stored in a native XML database ;-).

Sunday, January 8, 2017

Why rule FDAC036 is hypocritical

We all have encountered the message "Variable length is too long for actual data" when validating our SDTM, SEND or ADaM submissions using the "Pinnacle21 Validator". This error message appears in case we generated a SAS file for which a variable has been assigned a length which is (1 byte or more) larger than the length of the longest variable for that variable.
For example, if your longest AETERM has 123 characters, and you assigned a (SAS) length of 124, then this error will appear in the validation report - usually causing a lot of panic at the sponsor, as it might lead to a rejection of the submission.

In my prior blog entry, I already showed that SAS Transport 5 (SAS-XPT) is a very inefficient transport format. At the time it was developed, it was meant to enable exchange of data between different SAS systems, one on an IBM mainframe, the other being a VAX computer. Do you still own or use one of these? I don't. The format was quite OK for this purpose those days, but it is still unclear why the FDA selected that format, especially as I showed that CSV (comma separated values) is about 7 times as efficient on the average.
The FDA is mandated by law to be vendor-neutral. Although the specification for the SAS-XPT format is public (the famous TS-140 document), it is rather difficult to implement in non-SAS software, so one cannot say it is really vendor-neutral. So why did the FDA select this format in favor of CSV? Or was CSV not there yet, or couldn't it be read by the programs the FDA was using? If you know it, or can point me to any public literature, please let me know.

The FDA is always complaining about too large files, and that is why they came with the famous rule FDAC036. But isn't that the result of their own choice for the inefficient SAS-XPT format?

But let us also have a look at the SDTM standard itself. Those who have followed the evolution of the standard in the last 15 years or so know that with each new release, the number of variables has increased, thus leading to larger file sizes (also as in SAS-XPT a NULL value takes the same amount of bytes as a non-NULL value). Most of these new variables have been added ... on request of the FDA. Even worse is that most of the variables added on request of the FDA contain redundant information. A typical example are the --DY (study day) variables appearing in almost every domain. It's value can easily be calculated (also "on the fly") from the --DTC (date/time of collection) and the reference start date time (RFSTDTC) in the DM (Demographics) dataset.

So why do we need to add --DY to most of the datasets (with the danger that it is incorrect) whereas it can be calculated "on the fly"? The FDA answer is "in order to facilitate the review process". Does this mean that the review tools of the FDA cannot even do the simplest derivations? It can't be that hard - I added this feature to the open source "Smart Dataset-XML Viewer" in just one evening!

Another famous example is the "EPOCH" variable (rule FDAC021) which can normally (in a well designed study) be derived "on the fly" from the --DTC and/or the visit number. But it looks as the FDA prefers to add an extra variable to account for badly designed studies instead of requiring are well designed.

There are very many variables in SDTM that are unnecessary, and could easily be removed from the standard, as they contain redundant information. Even the --TEST (test name) variable could easily be removed, as it can simply be looked up (again "on the fly") in the define.xml.

In this example, LBTEST has been removed from the dataset, but the tool simply looks it up in the define.xml from the value of LBTESTCD

I estimate that about 20% of the SDTM variables is redundant, accounting for about 30% of the file size! So even when using the ineffĂ­cient SAS-XPT format, files sizes could be reduced by about 30% by removing these redundant variables, with the additional advantage of considerably improved data quality (redundancy is a killer for data quality).

Did you ever count how many times the same value for "STUDYID" appears in your submission SAS-XPT datasets? Well, it is in every record isn't it? The SAS-XPT format requires you to store it millions of times with the same value. Is that efficient? The reason for this is that essentially, the SDTM tables represent a "View" on an SDTM-database, rather than a database itself. In a real database, STUDYID would be stored once in a table with all studies (e.g. for the submission), and all other tables would reference it using a "foreign key", meaning that the other tables do not contain the STUDYID value itself, but a pointer to the value in the "studies" table. Now a pointer uses considerably less bytes that a (string) value itself.
The same applies to USUBJID: they are defined once (in DM) and should then be referenced (foreign key) from all other tables (using a pointer). Instead, SAS-XPT requires you to "hardcode" the value of each USUBJID as a string (not as a pointer) in the datasets.
For example, the well-known "LZZT 2013 pilot submission" has 121,749 records in the QS dataset for 306 subjects (an average of 398 records per subject). This QS dataset contains 121,749 times the same STUDYID value (12 bytes) and on an average, 398 times the same value for USUBJID per subject. So on the average, the same value for USUBJID (11 bytes) is hard-coded 398 times in the dataset, instead of using record pointers to DM. What a waste!

Remark that in our "Smart Dataset-XML Viewer", we do use pointers in such a case, in order to save memory, using the principle of "string interning".

But what if we could organize our datasets hierarchically? For example, order by subject and then by visit? So that in each dataset, the value of USUBJID would only appear once? And doesn't the "def:leaf" element in the define.xml already connect the STUDYID with the dataset itself, so that it is unnecessary in the dataset itself? That would be considerably more efficient isn't it?

The former (organization of the data per subject per visit) is exactly what the ODM standard is doing! The new Dataset-XML (based on ODM) doesn't do this: the CDISC development team decided to keep the old "2-dimensional" (but inefficient) representation in order to make it easier for the FDA to make the transition. Organizing the SDTM/SEND/ADaM data in the way ODM does it originally would further make the transport (file) more efficient.

But should all that matter? My colleagues in bioinformatics laugh at me when I tell them about the FDAC036 rule. In their business, the amount of information is much much higher, and they are able to exchange it efficiently, e.g. by using RESTful web services to exactly retrieve what is necessary.
As I already stated in the past, large amounts of data belong in databases, not in files. The file can only be a way of transport of data between applications. Essentially, when a submission arrives at the FDA, it should be immediately stored in a database (could e.g. also be a native XML database), and the reviewers should only be allowed to query such databases - they should not be allowed to mess around with files (XPT or any others). But we are still far from such a "best practice" situation, unfortunately.


Rule FDAC036 forces us to "save on every possible byte" when generating our SAS-XPT datasets, in order to avoid that their sizes become too large (for the gateway?). However, the SAS-XPT format itself is highly inefficient, and file sizes have grown considerably due to ever new requirements of the FDA, adding new redundant (SDTM) variables. Also we are forced to stay working with the highly inefficient two-dimensional representation, with lots of unnecessary repeats of the same information.

And then I did not speak yet about the prohibition by the FDA to submit compressed (zipped) datasets, which would reduce the file sizes by a factor of 20 and more.

It's up to you to decide whether FDA rule 036 is hypocritical or not ...