Wednesday, August 9, 2017

Implementing SDTM 1.5 in software: first impressions

Last Monday, due to a short break in my vacation (thunderstorms and mountaineering do not well fit together), I started implementing SDTM 1.5 in our popular SDTM-ETL mapping software.
The reason is that some of customers want to start working with SEND 3.1, which is the first implementation of SDTM 1.5. Remark that there is no SDTM-IG yet based on SDTM 1.5, only a SEND-IG.

What were the difficulties encountered? What were my first impressions on how easy or difficult it is to implement SDTM 1.5 in software?

First of all, there is no very good (i.e. a define-XML template) machine-readable version of SDTM 1.5. There is an Excel file available from SHARE, with a list of the variables, and an Excel list of differences with version 1.4. From the former, I could generate an XML file with variables, which I could use for the automated generation of the "CDISC Notes" in the software, and also help me somewhat generating a SEND define.xml 3.1 template for SDTM-ETL.
All the other things had to be done using the "good old" methods, i.e. copy-and-paste from the PDF documents. As I am not paid by the hour, you can guess that I didn't like this too much.

Once the SEND 3.1 define.xml template generated, I could start on the nitty gritty details. They require careful reading of the specification or IG, interprete what is written there, and program it in the software. Interpretation from a specification is always dangerous, as can be seen from the very many false positives generated from the validation software used by the FDA (No, not generated by our company).

The first problem I encountered is that the list of SDTM variables that "is never used in SEND" (see SEND-IG 3.1) does not come as machine-readable information. So, copy-and-paste was necessary.
SDTM-IG 3.2 (based on SDTM 1.4) provided a list of "not generally used" variables. As there is no SDTM-IG yet based on SDTM 1.5, I did not implement this yet, just copied the list of 1.4 instead just for the moment.

The new "--LOBXFL" ("Last Observation Before Exposure Flag") which I already criticized in the past, as it essentially is a derived variable (derived variables do not belong in SDTM), is something I already implemented in the software a few months ago, as I realized it has a major impact on the software. The user can now choose between generating/writing a mapping script himself, or to auto-generate the values during SDTM generation execution. The latter then requires an extra step, as the generated SDTM data needs to be ordered by subject and test, and compared with RFXSTDTC which is in another dataset. It must also be said that the text in the SDTM 1.5 specification is very undetailed. It says "Operationally-derived indicator used to identify the last non-missing value prior to RFXSTDTC. Should be Y or null." It doesn't state anything about whether this is "per unique test". I presume it is (opening the discussion again about what a "unique test" is). I strongly believe standards specification should be exact and precise. The definition of  "--LOBXFL" in SDTM 1.5 isn't.
Also remark that "--LOBXFL" is not even mentioned in the SEND-IG 3.1.

New in SDTM 1.5 is also the "Domain-Specific Variables for the General Observation Class" (see p. 23 of the specification). Although I understand the reasons for these, SDTM was always sold to us as containing "generic" variables, applicable to all kinds of clinical research data. I never believed in that concept. One of the reasons is surely that SDTM still wants to represent everything as 2-dimensional tables, although we all know that "the world is not flat and neither is clinical data".
As this (fortunately) short list of variables is only in the PDF, it required some extra programming with another copy-and-paste activity.

Unfortunately, the list also contains an error, or at least a severe unclarity. It states that "EXMETHOD" is such a domain-specific variable, stating "these variables are for use only in a specific domain ...". If we take this literally, this would mean that e.g. EGMETHOD and LBMETHOD are not allowed anymore. Really?
Or was it meant that "--METHOD" may only be used in combination with "EX" in the "Interventions" class? That's my interpretation sofar. But specifications shouldn't be open for different interpretations, isn't it?

A lesser problem is that the SDTM 1.5 specification also contains two new domains which i.m.o. should only appear in the SDTM-IG: "Subject Disease Milestones" and "Trial Disease Milestones" (TM). For the former, I couldn't even find the two-character domain abbreviation, so how could I implement this? I didn't care too much for now, as these two domains do not appear in the SEND-IG 3.1, so I need to wait until the new SDTM-IG is published.

Friday, June 16, 2017

SDTM, XPT and the constitution

Imagine that the constitution of your country would state "cars must be powered by gasoline".
Would you find that acceptable?
Now, we all know that most cars are powered by gasoline, but such a statement in the constitution would give electrical cars no chance at all, even when these are more friendly to the environment.

Something very similar happens at CDISC: the new SDTM Model v.1.6 (so not the Implementation Guide) has been written with only 1 implementation in mind: SAS-XPT format.

Standards models should be developed and published independent of the transport format. A very good example is HL7-FHIR for which there are three technical implementations: XML, JSON and RDF. The documentation has been published in a transport format independent way. It is only when you go to the examples (which you can consider as an Implementation Guide) that you will see something about the transport format.

So, as part of the public review of the SDTM Model v.1.6, I asked the SDTM team to change the text of the model in such a way that it is transport format neutral. This would then allow other transport formats such as XML (e.g. Dataset-XML), JSON and RDF for porting SDTM data in the future.

My request was turned down.
Here is the justification of the SDTM team (snapshot from the JIRA site):

"Considered for future" is the usual expression of the team for "refused".

This answer is indeed a "doom loop": it gives the FDA a reason for further refusing to allow a modern format. When asked about it, they can then say "we can't do that, it is not allowed by the SDTM model".

I have been observing in the last 5-10 years that the SDTM model and standard has been evolved in such a way that all first principles have been thrown away, such as avoidance of data redundancy, no derived data, and separation between model and implementation. This makes it more and more difficult to implement and ruins data quality. Essentially, one can say that it has been steered into a "dead end".

How can this be changed?
I must honestly say that I do not know the answer. "The train has left the station, but is it on the right track?" is a question that is even not posed within CDISC, and especially not within the SDTM team. Maybe the team needs some strong guidance itself, or responsibilities must be reassigned. There are some bright progressive people within CDISC but these are not involved in SDTM. Maybe it is time to give them the lead in SDTM development.

Friday, April 7, 2017

--LOBXFL can seriously damage your health

The addition of the new variable --LOBXFL (Last Observation before Exposure Flag) in SDTM 1.5 remains a controversial topic (as discussed here and here). According to the definition, --LOBXFL is "operationally derived", but the SDTM 1.5 specification does not say "how" it should be derived. There have been several complaints about this during the review period, but they were waved with the argument that they "should be addressed in any implementation guide". I am curious ...
My own request to "please provide guidance" was answered by:

 which I don't understand...

Now you may ask why I am so concerned about the addition of this new "derived" variable. Here are some issues:
  • derived variables should not appear in SDTM. Again, the SDTM team has given in on a request from the FDA caused by primitive and immature review tools used by some FDA reviewers
  • baseline flags should not appear in SDTM - they belong to ADaM
  • sponsors should not be asked to do the work of FDA reviewers - the latter have to make their own decisions of which of the data points is "the" baseline data point.
  • Assigning --LOBXFL is used to "camouflage" bad data quality. SDTM datasets with bad data quality should not be used in submissions and should not be accepted by the FDA.
Let me give an example.

The following is a snapshot of a VS dataset with measurements done on the date "2014-01-02" which is also the date of first exposure (according to EX, and to RFXSTDTC in DM - another unnecessary derived variable) using the open source "Smart Dataset-XML Viewer":

(remark that some columns have been swapped for better visibility)

According to the protocol, all vital signs measurements during this visit must be done before first drug intake. So the sponsor assigned the VSLOBXFL to the diastolic blood measurement with the value "76". What the sponsor however doesn't know, is that the researcher did the measurement immediately after the intake of the medication. As however too often, only the date (as well for the measurement as for the drug intake) was recorded, not the exact time.
Of course, the sponsor could also have assigned VSLOBXFL all three measurements on the date 2014-01-02, but as the standard does not specify "how" the derivation should be made ...
The same applies to the "PULSE" and "SYSBP" (systolic blood pressure) measurements:


If one inspects the data carefully, one will see that each of the "VSLOBXFL" records shows an increased value for that the specific measurent. This increased value may have been  caused by the intake of the study drug. However, this is not visible, nor detectable, as no times have been collected (as is very usual) for either the measurement as the drug intake. Even worse, the increased value is marked as the baseline value, which may mean that the reviewer, when looking at later data points, comes to the conclusion that the drug is lowering blood pressure and pulse, whereas it is exactly the inverse...

How does the "Smart Dataset-XML Viewer" deal with such a situation? 

One of the options of the "Smart Dataset-XML Viewer" is:

 When using it on a data point that is undoubtly (as it is on another day) the last measurement (for a specific test code) before first exposure, the viewer highlights the record:

When however the measurement is on the same day as the first exposure, and either the time part of the measurement or of the first exposure is not provided, the "Smart Dataset XML Viewer" will highlight the record or records and provide a warning:

 I pointed the SDTM team to all this in an additional review comment, which was answered as:

to which I responded:

Also the FDA reviewers have free access to the "Smart Dataset-XML Viewer", so they could use it too. On the other hand, the algorithm can also easily be implemented in SAS or any other modern review software.

As a conclusion, --LOBXFL is not only unnecessary, it also camouflages bad data quality. For reviewers, it is even potentially dangerous to trust on it as demonstrated above.
With --LOBXFL, it is just waiting for the first patient having his/her health seriously damaged ...

Sunday, March 5, 2017

Validating SDTM labels using RESTful web services

About a month ago, I reported about my first experiences with implementing the new CDISC SDTM-IG Conformance Rules. I now made considerable progress, having >60% of the rules implemented. These implementations are available for download and usage from here.

Today I want to elaborate a bit on how I implemented rule CG0303 "Variable Label = IG Label", using RESTful web services. Earlier implementations from others were based on copying/pasting the labels from the SDTM-IG and then hard-coding them in software. This does not only mean a lot of work, it is also error-prone, with the disadvantages that a software update is needed each time an error in the implementation is found. For example, if you search on the forum of the validation Software that the FDA is using for the wording "label mismatch" you will find many hits, especially about false positive errors. In some cases, one even gets an error on a label that looks 100% correct, but the software does not tell you what text for the label it expects. "Let the guessing begin"!
So we definitely need something better. Wouldn't it be better to use the SHARE content, load it into a central database, and query that database using a modern (easy-to-implement) RESTful web service?

That is exactly what we did. All SDTM-IG Information (from different IG versions) and all CDISC controlled terminology that is electronically available was loaded into a database, and RESTful web services were developed to make them available to anyone, and to any application. These RESTful web services (over 30 of them) are described here. Adding a new Service usually takes 1-2 hours, sometimes even less.

One of these services allows to retrieve all necessary information for a given variable in a given domain for a given SDTM-IG version. The RESTful query string description is:{sdtmigversion}/{domain}/{varname}

which is pretty self-explaining. For example, to get all the Information about the variable ECPORTOT in the domain EC for SDTM-IG 3.2, the query string is:

This service can now easily be used to validate labels in submissions, like in implementations of rule CG0303. Let's do so for a sample SDTM submission.
In our case, the SDTM submission resides in a native XML database (something the FDA SHOULD also do instead of messing around with SAS-XPT datasets). Here is the implementation of rule CG0303 in XQuery, an easy-to-learn language that is as well human-readable as machine-executable (so the rules are 100% transparent):

In the first part, the XML namespaces are declared and the location of the define.xml for this submission is set (usually this will be done by passing these as parameters from within the calling application). Also the base of the RESTful web Service is declared.

Here is the second part:

For each dataset in the submission (by iterating over all the dataset definitions "ItemGroupDef"), we get the domain name either from the dataset name or from the "Domain" Attribute in the define.xml (goes into $domain), and then start iterating over all the variables declared for the current dataset:

The variable name is obtained, and the label taken from the define.xml (remark that when using SDTM in XML, the label is in the define.xml and NOT in the dataset itself - which follows the good practice of separating data from metadata). The web service is then triggered returning the expected label from the database (can be SHARE in future), and the actual and expected label are compared. Remark that for some variables, there will not be a label from the SDTM-IG, as the variable is just not mentioned in the SDTM-IG, although it is allowed for that domain. In that case, there is nothing to compare.

If both Labels do not correspond, an error (in XML) is returned. An example is:

showing as well the actual as the expected label.

As the validation errors ("deviation" or "discrepancy" would in fact be a better word) come in XML, they can (unlike Excel or CSV) be used in many ways, and even ... stored in a native XML database ;-).

Sunday, January 8, 2017

Why rule FDAC036 is hypocritical

We all have encountered the message "Variable length is too long for actual data" when validating our SDTM, SEND or ADaM submissions using the "Pinnacle21 Validator". This error message appears in case we generated a SAS file for which a variable has been assigned a length which is (1 byte or more) larger than the length of the longest variable for that variable.
For example, if your longest AETERM has 123 characters, and you assigned a (SAS) length of 124, then this error will appear in the validation report - usually causing a lot of panic at the sponsor, as it might lead to a rejection of the submission.

In my prior blog entry, I already showed that SAS Transport 5 (SAS-XPT) is a very inefficient transport format. At the time it was developed, it was meant to enable exchange of data between different SAS systems, one on an IBM mainframe, the other being a VAX computer. Do you still own or use one of these? I don't. The format was quite OK for this purpose those days, but it is still unclear why the FDA selected that format, especially as I showed that CSV (comma separated values) is about 7 times as efficient on the average.
The FDA is mandated by law to be vendor-neutral. Although the specification for the SAS-XPT format is public (the famous TS-140 document), it is rather difficult to implement in non-SAS software, so one cannot say it is really vendor-neutral. So why did the FDA select this format in favor of CSV? Or was CSV not there yet, or couldn't it be read by the programs the FDA was using? If you know it, or can point me to any public literature, please let me know.

The FDA is always complaining about too large files, and that is why they came with the famous rule FDAC036. But isn't that the result of their own choice for the inefficient SAS-XPT format?

But let us also have a look at the SDTM standard itself. Those who have followed the evolution of the standard in the last 15 years or so know that with each new release, the number of variables has increased, thus leading to larger file sizes (also as in SAS-XPT a NULL value takes the same amount of bytes as a non-NULL value). Most of these new variables have been added ... on request of the FDA. Even worse is that most of the variables added on request of the FDA contain redundant information. A typical example are the --DY (study day) variables appearing in almost every domain. It's value can easily be calculated (also "on the fly") from the --DTC (date/time of collection) and the reference start date time (RFSTDTC) in the DM (Demographics) dataset.

So why do we need to add --DY to most of the datasets (with the danger that it is incorrect) whereas it can be calculated "on the fly"? The FDA answer is "in order to facilitate the review process". Does this mean that the review tools of the FDA cannot even do the simplest derivations? It can't be that hard - I added this feature to the open source "Smart Dataset-XML Viewer" in just one evening!

Another famous example is the "EPOCH" variable (rule FDAC021) which can normally (in a well designed study) be derived "on the fly" from the --DTC and/or the visit number. But it looks as the FDA prefers to add an extra variable to account for badly designed studies instead of requiring are well designed.

There are very many variables in SDTM that are unnecessary, and could easily be removed from the standard, as they contain redundant information. Even the --TEST (test name) variable could easily be removed, as it can simply be looked up (again "on the fly") in the define.xml.

In this example, LBTEST has been removed from the dataset, but the tool simply looks it up in the define.xml from the value of LBTESTCD

I estimate that about 20% of the SDTM variables is redundant, accounting for about 30% of the file size! So even when using the ineffĂ­cient SAS-XPT format, files sizes could be reduced by about 30% by removing these redundant variables, with the additional advantage of considerably improved data quality (redundancy is a killer for data quality).

Did you ever count how many times the same value for "STUDYID" appears in your submission SAS-XPT datasets? Well, it is in every record isn't it? The SAS-XPT format requires you to store it millions of times with the same value. Is that efficient? The reason for this is that essentially, the SDTM tables represent a "View" on an SDTM-database, rather than a database itself. In a real database, STUDYID would be stored once in a table with all studies (e.g. for the submission), and all other tables would reference it using a "foreign key", meaning that the other tables do not contain the STUDYID value itself, but a pointer to the value in the "studies" table. Now a pointer uses considerably less bytes that a (string) value itself.
The same applies to USUBJID: they are defined once (in DM) and should then be referenced (foreign key) from all other tables (using a pointer). Instead, SAS-XPT requires you to "hardcode" the value of each USUBJID as a string (not as a pointer) in the datasets.
For example, the well-known "LZZT 2013 pilot submission" has 121,749 records in the QS dataset for 306 subjects (an average of 398 records per subject). This QS dataset contains 121,749 times the same STUDYID value (12 bytes) and on an average, 398 times the same value for USUBJID per subject. So on the average, the same value for USUBJID (11 bytes) is hard-coded 398 times in the dataset, instead of using record pointers to DM. What a waste!

Remark that in our "Smart Dataset-XML Viewer", we do use pointers in such a case, in order to save memory, using the principle of "string interning".

But what if we could organize our datasets hierarchically? For example, order by subject and then by visit? So that in each dataset, the value of USUBJID would only appear once? And doesn't the "def:leaf" element in the define.xml already connect the STUDYID with the dataset itself, so that it is unnecessary in the dataset itself? That would be considerably more efficient isn't it?

The former (organization of the data per subject per visit) is exactly what the ODM standard is doing! The new Dataset-XML (based on ODM) doesn't do this: the CDISC development team decided to keep the old "2-dimensional" (but inefficient) representation in order to make it easier for the FDA to make the transition. Organizing the SDTM/SEND/ADaM data in the way ODM does it originally would further make the transport (file) more efficient.

But should all that matter? My colleagues in bioinformatics laugh at me when I tell them about the FDAC036 rule. In their business, the amount of information is much much higher, and they are able to exchange it efficiently, e.g. by using RESTful web services to exactly retrieve what is necessary.
As I already stated in the past, large amounts of data belong in databases, not in files. The file can only be a way of transport of data between applications. Essentially, when a submission arrives at the FDA, it should be immediately stored in a database (could e.g. also be a native XML database), and the reviewers should only be allowed to query such databases - they should not be allowed to mess around with files (XPT or any others). But we are still far from such a "best practice" situation, unfortunately.


Rule FDAC036 forces us to "save on every possible byte" when generating our SAS-XPT datasets, in order to avoid that their sizes become too large (for the gateway?). However, the SAS-XPT format itself is highly inefficient, and file sizes have grown considerably due to ever new requirements of the FDA, adding new redundant (SDTM) variables. Also we are forced to stay working with the highly inefficient two-dimensional representation, with lots of unnecessary repeats of the same information.

And then I did not speak yet about the prohibition by the FDA to submit compressed (zipped) datasets, which would reduce the file sizes by a factor of 20 and more.

It's up to you to decide whether FDA rule 036 is hypocritical or not ...

Sunday, December 18, 2016

How SAS-XPT works (well: inefficient)

In my previous posts and conference contributions,  I have already shown that file size should not be an issue when doing electronic submissions to the FDA, as once the data loaded in a database, the amount of disk space and memory usage has become independent of what the transport format was. But it looks as many regulatory reviewers use the CDISC submission files "as is", so do not load it into a database at all. Some of them even seem the use the ancient "SASViewer" as the only tool to do analysis. The latter does not allow to do filtering before loading, so a reviewer needs to load the whole SAS-XPT file into memory before starting doing filtering. This in contrast to the "SmartDataset-XML viewer" which allows to do filtering before loading data into memory.

This has resulted in the famous rule "FDAC036" stating "Variable length should be assigned based on actual stored data to minimize file size. Datasets should be re-sized to the maximum length of actual data used prior to splitting".

Before writing my next blog "Why rule FDAC036 is hypocritical" I first need to explain how SAS-XPT format works. In literature one finds very little information except for the famous TS-140 document "The record layout of SAS data sets in SAS Transport (XPORT) format", which can only be understood by IT specialist that understand how the IBM mainframe number representations work (who still does?).

So let me explain a bit how SAS-XPT works as a format.
Each SAS-XPT starts with a number of "header" records. These can be seen as containing the metadata of the file contents. The first 6 header records contain general information about the file. Each of these header records is 80 bytes long, when less information is present, the record is padded with blanks. This can be visualized by a punchcard (yes, I used them 40 years ago) that does not have all 80 columns punched:

One can already see that this is not very efficient (and not very modern either).
The next records are all 140 character long. These are the so-called "NAMESTR" records: each of them contains the metadata of a single variable, like the variable name (max. 8 characters), the label (max. 40 characters), whether the variable is "numeric" or "char". Again, blanks are added to the fields when there is less information. For example, when the label is "STUDY ID" (8 characters), 32 blanks are added to the field to make it up to the 40 characters defined for the variable label.
The 140 character "NAMESTR" records are put together, and then broken into 80 bytes pieces. If that does not exactly fit, additional blanks are added after the last "NAMESTR" record to make the total number of bytes a multiple of 80. So one already sees that with all these added blanks, the "NAMESTR" structure is not very efficient either.
The first record after all the "NAMESTR" records is the "Observation header". It is an 80-character record, just stating that the actual data come after this record. It looks like:

After the "Observation header" come the data, one record for each record in the row. Each of these has the same length, independent on whether it contains much information or not: missing information is replaced by blanks. This makes XPT storage very inefficient. Let me explain with an example. Let us suppose that I have a variable (e.g. "MHTERM") and the longest value is "The quick brown fox jumps over the lazy dog", which is 43 characters. This is also the length declared in the header (rule FDAC036). In the next record, the value is "Hello world" (11 characters). In this record, the value will also take 43 characters, i.e. 32 blanks are added. In the third record, the value is "Yes" (3 characters), so the field is additionally filled with 39 blanks.

This can be visualized as follows:

The "yellow" bytes are bytes that are additionally filled with blanks that do not contain information ("wasted blanks"). One immediately sees that this is not efficient. The efficiency is 100% for this field for the (first) record that contains the longest string, but for the second record, the efficiency already decreases to 26% (11/43), and in the third record it descreases to 7%. The "overall" efficiency here is 44%.

The second thing is that SAS-XPT stores numeric values (making no difference between integers and floating point numbers) always using 8 bytes. This is even done when the numbers are small (like the --SEQ or --DY values in SDTM) and could e.g. be taken care of by a "short" (2 bytes, range from -32,768 to 32,767. This is not such a big deal, as we often use a "too wide" data type for numeric values anyway in practice: also define.xml does not have the data type "short".

I did an analysis of how efficient XPT storage is on a real example. It uses the well known "LZZT 2013" sample submission. I took the SDTM example, and concentrated on the LB, QS and the SUPPLB files as these are the largest (55, 33 and 33 MB respectively). Although this is a relative "small" submission, the results of the analysis can and may be extrapolated to larger submissions (the efficiency does not change with adding more data to such a submission). The full file with all results can be obained on request - just drop me an e-mail.

The results are surprising:


Variable Name Length Efficiency (%)
STUDYID 12 100
DOMAIN 2 100
USUBJID 11 100
QSSEQ 8 34
QSTEST 40 65
QSCAT 70 59
QSSCAT 26 53
VISIT 10 40
QSDTC 10 100
QSDY 8 22

One sees that the "storage efficiency" can be as low a 0.3%, but even more important is to notice that the "storage efficiency" for the longest field (QSCAT) does not exceed 60%. The "overall" "storage efficiency" for this file is 52.0%. So one could state that this file consists for almost 50% out of (unnecessary?) blanks. Not very efficient indeed.

For LB we find:

Variable Name Length Efficiency (%)
STUDYID 12 100
DOMAIN 2 100
USUBJID 11 100
LBSEQ 8 25
LBTEST 200 7
LBCAT 10 60
LBNRIND 200 53
VISIT 19 39
LBDTC 16 100
LBDY 8 23

The "overall" "storage efficiency" being 21%. This result is however biased by the fact that the length for LNNRIND and LBTEST were not optimized (both were set to 200). For example, if the length for LBTEST would have been set to 40, the "storage efficiency" for LBTEST would go up from 7% to 33%.

For SUPPLB we find:

Variable Name Length Efficiency (%)
STUDYID 12 100
USUBJID 11 100
IDVAR 8 63
IDVARVAL 200 1.3
QNAM 8 7
QLABEL 40 76
QVAL 200 1.4
QORIG 200 3.5
QEVAL 200 11

Supplemental qualifier datasets are always very difficult: one does not know in advance what values will go into it (QVAL usually contains a mix of numeric and text values). Therefore, most implementors set the length for QVAL (and often also for QORIG) to the maximum of 200, which was also done in this example. This usually means that the "storage efficiencies" for these variables are extremely low.
However, if we we would have set the length for QVAL to "3", which is the longest value found (almost all the values are numeric of the form "x.y"), then the efficiency would be near 100%. As QVAL can contain anything, we however also have seen very low "storage efficiencies", for example when one record has a QVAL value having 180 characters (so the length must be set to 180), and all other QVAL values are very short, e.g. "Y" for a xxCLSIG supplemental qualifier.
As such, and due to the vertical structure, supplemental qualifiers datasets must be regarded as being very inefficient, at least when implemented as SAS-XPT. We have noticed that the equivalents in CDISC Dataset-XML format very often have considerable lower file sizes than the SAS-XPT representation of supplemental qualifier datasets. So using XML does not always mean larger files.

But what would have been the alternatives 20 years ago when the FDA decided for SAS-XPT?
The first and very simplest would of course have been to use CSV (comma-separated values). Therefore, we did a simple test. We transformed the lb.xpt file into CSV. We found that the file size decreased from 66 MB (lb.xpt) to 9 MB (lb.csv). This means a 7-fold decrease in size! So why didn't the FDA select simple "comma-separated values" format 20 years ago (before XML was present) as a submission instead of SAS-XPT? CSV is really vendor-neutral (I do not consider SAS-XPT vendor neutral as it requires almost extinct IT skills in order to implement it in anything other than SAS).

I do not know the answer to the question, maybe someone can tell me?

If the FDA would insist of having a binary format (XML is a text-based format), then why didn't they develop a format which I usually name the "VARCHAR" format. This kind of format is also used in DICOM, the worldwide standard for exchange of images in the medical world. Picking up our "quick brown fox" example again, it works as follows:

Just as most binary formats, it is a continuous format, so no "line breaks" or so.
For each variable, the first byte contains the length of the variable value (e.g. 43), and the following N (in this case 43) bytes contain the value itself. This is immediately followed by the length definition of the following field (in this case 3), followed by the value itself etc..
Like this, the storage is extremely sufficient, much more sufficient than the SAS-XPT format. Why did the FDA never consider such an efficient format?

The best alternative nowadays is however still XML, at least if one wants to keep the 2-dimensional structure of SDTM (which I will question in my next blog). In CDISC Dataset-XML, no "trailing blanks" are ever added, and missing ("NULL") values are just not in the file. File sizes are however usually stil larger than for XPT, due to the "tags", describing what the value is about. XML files can however easily be compressed (e.g. as "zip", or "tar" of even as "tar.gz", the latter a format the FDA is using anyway), to file sizes less than 3%, and the resulting files need NOT to be decompressed to be able to be read by modern software such as the "Smart Dataset-XMLsoftware".

But anyway, what are we talking about? The usual storage cost of a complete submission study is far below 1 US$ (storage cost of 1GB is US$ 0.02 in 2016).

When loaded into a database, the efficiency of the transport file does not matter at all, with the additional advantage that databases can be indexed to be able to work (and thus review) much faster.
But many of the FDA reviewers seem not to be able to use any of these modern technologies, and thus require us to implement rule FDAC036, this because the FDA made a bad decision in the choice for a transport format in the past.

So, up to my next blog: "Why rule FDAC036 is hypocritical".

Tuesday, October 4, 2016

SDTM-IG in machine-readable format

Each time a new version of the SDTM or SDTM-IG is published, people suffer. The FDA suffers as it needs to adapt its database, software vendors suffer because they need to generate new templates, builders of validation engines suffer because they must make an interpretation of the rules, and that all starting from a ... PDF document.

If I look at the SDTM-IG, I see a lot of structured information. Sections like "Description/Overview", "Specification", "Assumptions", "Examples" appear for each described domain. "Structured" means that it must be possible to put the information in an XML document, and make good parts of it machine-readable, allowing to automate tasks such as template building, rule generation, database setup.
Using the PDF document only, lots of copy-and-paste needs to be done, leading to a lot of frustration and error. Fortunately, the situation has somewhat improved, as parts of the IGs can now be downloaded in the form of Excel files and even a define.xml. But even then, lots of information is still only available as PDF text or tables, not to speak about business rules that later go into validation tools (over a very subjective  interpretation step) that are out of the control of CDISC.

So I started a very first attempt to do something about that. It is still very primitive, but may be a starting point for a more serious attemp. I limited myself to the EX and EC domains in the Interventions class, which can be found in the "Section 6.1 - EX and EC Domains" portfolio of the SDTM-IG 3.2 PDF document.
I started with the highly structured information. In the XML it is:

It contains all the information one can also find in the SDTM-IG, but also some additional information like the recommended Define-XML datatype (the IG only mentions the SAS-XPT datatype and mixes up controlled terminology and controlled format). For the controlled terminology, it explicitely states the type and ID of the CDISC-NCI codelist, e.g.:

and if the codelist is "sponsor defined":

All this is machine-readable, but we of course also want to have this information human-readable. So I generated a very simple stylesheet which pretty well mimicks the PDF. Applying it to the "specification" part gives the following "human-readable" view:

looking extremely similar to what is in the PDF:

Also, for some of the variables, we can also add a "rule", like:

which is again machine-readable.

For the the "Assumptions", I took a similar approach, e.g.:

translated by the stylesheet into the human-readable view again as:

again very similar to what is seen in the PDF (but the stylesheet can be further improved here).

For the "Examples" part, this is mostly narrative text and some tables, without fixed structure, so I decided to allow to embed XHTML into my document. This is exactly the same as what HL7-FHIR is doing (and what we intend to allow for CDISC ODM 2.0), i.e.:

translated into the "human-readable" view by the stylesheet:

What still needs to be done

I haven't added all information yet (for EX and EX) into the XML structure, still need to complete the examples (there is a good amount of them). Also, I want to further improve the stylesheet.
When everything is at an acceptable level, I will publish the XML, the stylesheet and the resulting HTML for download. So... stay tuned!

It must however be clear that it is well possible to have the SDTM-IG (or at least a good part of it) in a machine-readable format, allowing to automate some tasks that now are cumbersome, as they are based on copy-and-paste from PDF files.