Sunday, December 18, 2016

How SAS-XPT works (well: inefficient)

In my previous posts and conference contributions,  I have already shown that file size should not be an issue when doing electronic submissions to the FDA, as once the data loaded in a database, the amount of disk space and memory usage has become independent of what the transport format was. But it looks as many regulatory reviewers use the CDISC submission files "as is", so do not load it into a database at all. Some of them even seem the use the ancient "SASViewer" as the only tool to do analysis. The latter does not allow to do filtering before loading, so a reviewer needs to load the whole SAS-XPT file into memory before starting doing filtering. This in contrast to the "SmartDataset-XML viewer" which allows to do filtering before loading data into memory.

This has resulted in the famous rule "FDAC036" stating "Variable length should be assigned based on actual stored data to minimize file size. Datasets should be re-sized to the maximum length of actual data used prior to splitting".

Before writing my next blog "Why rule FDAC036 is hypocritical" I first need to explain how SAS-XPT format works. In literature one finds very little information except for the famous TS-140 document "The record layout of SAS data sets in SAS Transport (XPORT) format", which can only be understood by IT specialist that understand how the IBM mainframe number representations work (who still does?).

So let me explain a bit how SAS-XPT works as a format.
Each SAS-XPT starts with a number of "header" records. These can be seen as containing the metadata of the file contents. The first 6 header records contain general information about the file. Each of these header records is 80 bytes long, when less information is present, the record is padded with blanks. This can be visualized by a punchcard (yes, I used them 40 years ago) that does not have all 80 columns punched:

One can already see that this is not very efficient (and not very modern either).
The next records are all 140 character long. These are the so-called "NAMESTR" records: each of them contains the metadata of a single variable, like the variable name (max. 8 characters), the label (max. 40 characters), whether the variable is "numeric" or "char". Again, blanks are added to the fields when there is less information. For example, when the label is "STUDY ID" (8 characters), 32 blanks are added to the field to make it up to the 40 characters defined for the variable label.
The 140 character "NAMESTR" records are put together, and then broken into 80 bytes pieces. If that does not exactly fit, additional blanks are added after the last "NAMESTR" record to make the total number of bytes a multiple of 80. So one already sees that with all these added blanks, the "NAMESTR" structure is not very efficient either.
The first record after all the "NAMESTR" records is the "Observation header". It is an 80-character record, just stating that the actual data come after this record. It looks like:

After the "Observation header" come the data, one record for each record in the row. Each of these has the same length, independent on whether it contains much information or not: missing information is replaced by blanks. This makes XPT storage very inefficient. Let me explain with an example. Let us suppose that I have a variable (e.g. "MHTERM") and the longest value is "The quick brown fox jumps over the lazy dog", which is 43 characters. This is also the length declared in the header (rule FDAC036). In the next record, the value is "Hello world" (11 characters). In this record, the value will also take 43 characters, i.e. 32 blanks are added. In the third record, the value is "Yes" (3 characters), so the field is additionally filled with 39 blanks.

This can be visualized as follows:

The "yellow" bytes are bytes that are additionally filled with blanks that do not contain information ("wasted blanks"). One immediately sees that this is not efficient. The efficiency is 100% for this field for the (first) record that contains the longest string, but for the second record, the efficiency already decreases to 26% (11/43), and in the third record it descreases to 7%. The "overall" efficiency here is 44%.

The second thing is that SAS-XPT stores numeric values (making no difference between integers and floating point numbers) always using 8 bytes. This is even done when the numbers are small (like the --SEQ or --DY values in SDTM) and could e.g. be taken care of by a "short" (2 bytes, range from -32,768 to 32,767. This is not such a big deal, as we often use a "too wide" data type for numeric values anyway in practice: also define.xml does not have the data type "short".

I did an analysis of how efficient XPT storage is on a real example. It uses the well known "LZZT 2013" sample submission. I took the SDTM example, and concentrated on the LB, QS and the SUPPLB files as these are the largest (55, 33 and 33 MB respectively). Although this is a relative "small" submission, the results of the analysis can and may be extrapolated to larger submissions (the efficiency does not change with adding more data to such a submission). The full file with all results can be obained on request - just drop me an e-mail.

The results are surprising:


Variable Name Length Efficiency (%)
STUDYID 12 100
DOMAIN 2 100
USUBJID 11 100
QSSEQ 8 34
QSTEST 40 65
QSCAT 70 59
QSSCAT 26 53
VISIT 10 40
QSDTC 10 100
QSDY 8 22

One sees that the "storage efficiency" can be as low a 0.3%, but even more important is to notice that the "storage efficiency" for the longest field (QSCAT) does not exceed 60%. The "overall" "storage efficiency" for this file is 52.0%. So one could state that this file consists for almost 50% out of (unnecessary?) blanks. Not very efficient indeed.

For LB we find:

Variable Name Length Efficiency (%)
STUDYID 12 100
DOMAIN 2 100
USUBJID 11 100
LBSEQ 8 25
LBTEST 200 7
LBCAT 10 60
LBNRIND 200 53
VISIT 19 39
LBDTC 16 100
LBDY 8 23

The "overall" "storage efficiency" being 21%. This result is however biased by the fact that the length for LNNRIND and LBTEST were not optimized (both were set to 200). For example, if the length for LBTEST would have been set to 40, the "storage efficiency" for LBTEST would go up from 7% to 33%.

For SUPPLB we find:

Variable Name Length Efficiency (%)
STUDYID 12 100
USUBJID 11 100
IDVAR 8 63
IDVARVAL 200 1.3
QNAM 8 7
QLABEL 40 76
QVAL 200 1.4
QORIG 200 3.5
QEVAL 200 11

Supplemental qualifier datasets are always very difficult: one does not know in advance what values will go into it (QVAL usually contains a mix of numeric and text values). Therefore, most implementors set the length for QVAL (and often also for QORIG) to the maximum of 200, which was also done in this example. This usually means that the "storage efficiencies" for these variables are extremely low.
However, if we we would have set the length for QVAL to "3", which is the longest value found (almost all the values are numeric of the form "x.y"), then the efficiency would be near 100%. As QVAL can contain anything, we however also have seen very low "storage efficiencies", for example when one record has a QVAL value having 180 characters (so the length must be set to 180), and all other QVAL values are very short, e.g. "Y" for a xxCLSIG supplemental qualifier.
As such, and due to the vertical structure, supplemental qualifiers datasets must be regarded as being very inefficient, at least when implemented as SAS-XPT. We have noticed that the equivalents in CDISC Dataset-XML format very often have considerable lower file sizes than the SAS-XPT representation of supplemental qualifier datasets. So using XML does not always mean larger files.

But what would have been the alternatives 20 years ago when the FDA decided for SAS-XPT?
The first and very simplest would of course have been to use CSV (comma-separated values). Therefore, we did a simple test. We transformed the lb.xpt file into CSV. We found that the file size decreased from 66 MB (lb.xpt) to 9 MB (lb.csv). This means a 7-fold decrease in size! So why didn't the FDA select simple "comma-separated values" format 20 years ago (before XML was present) as a submission instead of SAS-XPT? CSV is really vendor-neutral (I do not consider SAS-XPT vendor neutral as it requires almost extinct IT skills in order to implement it in anything other than SAS).

I do not know the answer to the question, maybe someone can tell me?

If the FDA would insist of having a binary format (XML is a text-based format), then why didn't they develop a format which I usually name the "VARCHAR" format. This kind of format is also used in DICOM, the worldwide standard for exchange of images in the medical world. Picking up our "quick brown fox" example again, it works as follows:

Just as most binary formats, it is a continuous format, so no "line breaks" or so.
For each variable, the first byte contains the length of the variable value (e.g. 43), and the following N (in this case 43) bytes contain the value itself. This is immediately followed by the length definition of the following field (in this case 3), followed by the value itself etc..
Like this, the storage is extremely sufficient, much more sufficient than the SAS-XPT format. Why did the FDA never consider such an efficient format?

The best alternative nowadays is however still XML, at least if one wants to keep the 2-dimensional structure of SDTM (which I will question in my next blog). In CDISC Dataset-XML, no "trailing blanks" are ever added, and missing ("NULL") values are just not in the file. File sizes are however usually stil larger than for XPT, due to the "tags", describing what the value is about. XML files can however easily be compressed (e.g. as "zip", or "tar" of even as "tar.gz", the latter a format the FDA is using anyway), to file sizes less than 3%, and the resulting files need NOT to be decompressed to be able to be read by modern software such as the "Smart Dataset-XMLsoftware".

But anyway, what are we talking about? The usual storage cost of a complete submission study is far below 1 US$ (storage cost of 1GB is US$ 0.02 in 2016).

When loaded into a database, the efficiency of the transport file does not matter at all, with the additional advantage that databases can be indexed to be able to work (and thus review) much faster.
But many of the FDA reviewers seem not to be able to use any of these modern technologies, and thus require us to implement rule FDAC036, this because the FDA made a bad decision in the choice for a transport format in the past.

So, up to my next blog: "Why rule FDAC036 is hypocritical".

Tuesday, October 4, 2016

SDTM-IG in machine-readable format

Each time a new version of the SDTM or SDTM-IG is published, people suffer. The FDA suffers as it needs to adapt its database, software vendors suffer because they need to generate new templates, builders of validation engines suffer because they must make an interpretation of the rules, and that all starting from a ... PDF document.

If I look at the SDTM-IG, I see a lot of structured information. Sections like "Description/Overview", "Specification", "Assumptions", "Examples" appear for each described domain. "Structured" means that it must be possible to put the information in an XML document, and make good parts of it machine-readable, allowing to automate tasks such as template building, rule generation, database setup.
Using the PDF document only, lots of copy-and-paste needs to be done, leading to a lot of frustration and error. Fortunately, the situation has somewhat improved, as parts of the IGs can now be downloaded in the form of Excel files and even a define.xml. But even then, lots of information is still only available as PDF text or tables, not to speak about business rules that later go into validation tools (over a very subjective  interpretation step) that are out of the control of CDISC.

So I started a very first attempt to do something about that. It is still very primitive, but may be a starting point for a more serious attemp. I limited myself to the EX and EC domains in the Interventions class, which can be found in the "Section 6.1 - EX and EC Domains" portfolio of the SDTM-IG 3.2 PDF document.
I started with the highly structured information. In the XML it is:

It contains all the information one can also find in the SDTM-IG, but also some additional information like the recommended Define-XML datatype (the IG only mentions the SAS-XPT datatype and mixes up controlled terminology and controlled format). For the controlled terminology, it explicitely states the type and ID of the CDISC-NCI codelist, e.g.:

and if the codelist is "sponsor defined":

All this is machine-readable, but we of course also want to have this information human-readable. So I generated a very simple stylesheet which pretty well mimicks the PDF. Applying it to the "specification" part gives the following "human-readable" view:

looking extremely similar to what is in the PDF:

Also, for some of the variables, we can also add a "rule", like:

which is again machine-readable.

For the the "Assumptions", I took a similar approach, e.g.:

translated by the stylesheet into the human-readable view again as:

again very similar to what is seen in the PDF (but the stylesheet can be further improved here).

For the "Examples" part, this is mostly narrative text and some tables, without fixed structure, so I decided to allow to embed XHTML into my document. This is exactly the same as what HL7-FHIR is doing (and what we intend to allow for CDISC ODM 2.0), i.e.:

translated into the "human-readable" view by the stylesheet:

What still needs to be done

I haven't added all information yet (for EX and EX) into the XML structure, still need to complete the examples (there is a good amount of them). Also, I want to further improve the stylesheet.
When everything is at an acceptable level, I will publish the XML, the stylesheet and the resulting HTML for download. So... stay tuned!

It must however be clear that it is well possible to have the SDTM-IG (or at least a good part of it) in a machine-readable format, allowing to automate some tasks that now are cumbersome, as they are based on copy-and-paste from PDF files.

Thursday, September 8, 2016

--LOBXFL: a follow up

Recently, my French colleague Thierry Lambert noticed "It is often not possible to derive EPOCH in a generic manner: at baseline, you assign "SCREENING" to observations done on the day of treatment start because you know from the protocol and the CRF they were done before the first drug intake. The dates being the same, you cannot assign EPOCH automatically in this kind of case."

The same essentially applies to --LOBXFL: If you only know the date of the observation (without the time), and the observation is on the same date as the first exposure/treatment, you cannot know whether the observation was before or after the first exposure. So, it would essentially be impossible to assign a value of --LOBXFL at all, unless you know that the protocol stated that the observations had to be made before the first exposure, and you trust that the investigator did exactly as is stated in the protocol. However ... trust is good, control is better ...

So, I decided to slightly change the algorithm for assigning baseline "last observation before exposure" records in the "Smart Dataset-XML Viewer". What I did is that when the last observation before exposure is clearly before first exposure, either because for both the time part is available, or the observation was on another day, being before the first exposure day, then all is safe, and the record is marked as "last observation before exposure".

If however, we do not have the time of the observation, and the date is the same as the first exposure date, this means that we cannot be 100% sure that the observation was before the first exposure. In that case, we mark the record using a different color, and provide a tooltip providing a warning.

Let us take an example. First we inspect the DM dataset:

For subject 1015, we see that the date of first exposure is "2014-01-02". This indeed corresponds exactly with the earliest exposure record for that subject in EX.

Let us now take some laboratory (LB) records:

We see that some "last observation before treatment" records have automatically been assigned. For both in this picture, the observation date(time) is clearly before the date of first observation, and in both cases the following record is clearly after the first exposure date. So this is safe.
Let us also have a look at the vital signs (VS) dataset:

Also here, for the last "HEIGHT" measurement before first treatment, the assignment is safe, as the measurement was done a week before first treatment. A bit further we do however find:

We notice that the pulse rate measurements were performed on the same date as first treatment, but as no time part is given, we do not know exactly when. We could suppose that they were done before the first exposure, but we cannot be 100% sure. Even when the first exposures were exactly registered (including a time part), we still could not be 100% sure, as the VSDTCs all have a missing time part. So in this case, we are not safe.
So the software treats this differently, another background color is assigned, an a warning tooltip is provided.

Can the FDA tools do this?

I am uploading the new executables as well as the source code to SourceForge later today. Please feel free to use the software, use the source code in any way you would like. It's Open Source!

I think my next blog entry might get the title "--LOBXFL can seriously damage your health" ...

Tuesday, August 30, 2016

Why --LOBXFL should not be in SDTM

In my previous post from last week, I argued that the new SDTM variable --LOBXFL (Last Observation Before Exposure Flag) should not be in SDTM, as it is a derived variable, and can easily be calculated "on the fly" by review tools.
I also promised to implement such a "on the fly derivation" in the Open Source review tool "Smart Dataset-XML Viewer". The latter already has features for "on the fly" calculation of other derived variables like EPOCH and --DY.

It took me about 6 hours to implement the new feature "highlight last observation before exposure" in the Viewer. When the user now selects "Options - Settings" and navigates to the "Smart features" tab, a new option becomes visible:

The new option is near the bottom "Derive and Highlight Last Observation records before first Exposure". Essentially, this corresponds to records where the future value of --LOBXFL is "Y", but here, derived "on the fly" instead of relying on the flag in the record.
Also remark the first two checkboxes, which allow "on the fly" derivation of first and last study treatment (based on EX) and displaying that as a tooltip on the DM record, essentially making RFXSTDTC (Date/Time of First Study Treatment) superfluous.

When checking the checkbox "Derive and Highlight Last Observation records before first exposure", a new dialog is displayed, asking the user to choose between two additional options:

It asks the user whether the derivation of the "last exposure before first treatment" records should either be based on trusting RFXSTDTC in DM, or that the tool will also retrieve the "first exposure" for each subject from EX. The best is of course to use the second option, as reviewers should essentually make their own judgements and not rely on derived information (which may be erroneous) submitted by the sponsor.
However, to demonstrate this, let us "trust" the submitted value of RFXSTDTC (Date/Time of First
Study Treatment).

After loading the SDTM submission datasets, let us have a quick look at the DM dataset. Here it is:

One sees that the first and last date/time of study treatment exposure are displayed as a tooltip on the "USUBJID" cell, making RFXSTDTC and RFENDTC superfluous (also remark that subject 1057 was a screen failure). For subject 1015, the date/time of first exposure is 2014-01-02, as derived from the EX records.

Let us now inspect the VS (vital signs) dataset. I moved some columns around (another of the many features of the viewer) to obtain a more "natural" order of the variables.

One sees that 3 records for DIABP (diastolic blood pressure) are highlighted. Their VSDTC (date/time of collection) is identical and equal to the first treatment date.
This already leads to a first discussion point about baseline flags, which is a discussion about data quality: if treatment and observation points are not precisely collected (i.e. including the time, not only the date), one cannot always know whether an observation was made before or after the first treatment. In this case, one only knows the observations were made ON the same day as the first treatment.
Also, we see that the sponsor assigned baseline flags (VSBLFL=Y) are correct.

Let us look somewhat further in the table:

We see that the last observation for "HEIGHT" before first study treatment is highlighted, and we see that 3 records for "PULSE" (Pulse Rate) are highlighted. We however also see that for the highlighted "HEIGHT" record, the sponsor did not set a baseline flag. It might have been forgotten, or it was decided that "HEIGHT" is irrelevant for the analysis of this study. A reviewer may judge differently.

For the second subject (1023), we find:

Something is strange here! The first three records for DIABP are marked as "last observation before first study treatment", but the baseline flags set by the sponsor are not on these records, but appear for the observations in the next visit.

What happened?
Did the sponsor assign the baseline flags incorrectly? Or did something else happen?
Another possibility is that RFXSTDTC was incorrectly derived by the sponsor (in DM), and as we decided to base the "on the fly derivation" on RFXSTDTC (which reviewers should i.m.o. not do), the "last observation records" are incorrectly assigned.

So let's not trust the submitted RFXSTDTC and let the tool derive it from the EX records:

And then inspect the generated table for subject 1023 again:

We now see that the highlighted records (derived "on the fly") now correspond to the records for which the sponsor set the baseline flag to "Y".

If we go back to the DM record for this subject, everything becomes clear:

We see that some way or another, the value for RFXSTDTC was not correctly assigned by the sponsor. It states "2012-08-02", whereas the real first exposure date/time (derived "on the fly" from EX and displayed on the tooltip) is "2012-08-05".


These results show again that:
  • derived variables should NOT be in SDTM, as they can easily be calculated or derived "on the fly" by review tools
  • derived variables mean data redundancy, which is always bad in data sets: if two values for the same data point differ in value, one can never know which one is incorrect
  • reviewers should NEVER, NEVER make decisions based on derived variables that were submitted by the sponsor, be it baseline flags, --DY values, or EPOCH values. They should use their tools for deriving them themselves directly from the real source data.
  • implementing such "on the fly derivations" in review tools is "piece of cake". It took me just 6 hours to implement the current one in the "Smart Dataset-XML viewer". Implementing other similar features even costed me less time.

I still need to clean up my source code a bit, and will then publish a new version of the software, including the source code, on the SourceForge project website. Once done, I will let you know through a comment.

As usual, your comments are very welcome.

Also read the follow-up post "--LOBXFL: a follow up"

Tuesday, August 23, 2016

SDTM derails: new derived variables

The "Study Data Tabulation Model v.1.5" has recently been published as part of the new SEND standard v.3.1. The SDTM Implementation Guide (SDTM-IG) describing how the SDTM model v.1.5 should be implemented in the case of human studies will probably be released for public review in the next weeks.

A quick view on the "Changes from v.1.4 to v.1.5" reveals that some new variables have been added to the model, including some "derived" ones, and some that essentially contain metadata.
However, SDTM, according to its own principles, should not contain derived data, and metadata should go into the define.xml, not into the datasets themselves.

The most obvious new variable is the --LOBXFL (Last Observation Before Exposure Flag) which can only have the value Y or null. It's definition is: "Operationally-derived indicator used to identify the last non-missing value prior to RFXSTDTC" (the latter is the datetime of first study drug/treatment exposure).
This variable is clearly "derived" and should not be in SDTM. So why is it there?
The answer is found in the latest version of the FDA "Study Data Technical Conformance Guide v.3.1" (Juli 2016) stating: "Baseline flags (e.g., last non-missing value prior to first dose) for Laboratory results, Vital Signs, ECG, Pharmacokinetic Concentrations, and Microbiology results. Currently, for SDTM, baseline flags should be submitted if the data were collected or can be derived".
The SDTM development team seems to have taken the occasion to make this a new variable, with the possibility to phase out the --BLFL variable which was not well defined. 

In my opinion, derived variables (such as EPOCH, --DY, etc.) should be calculated by the review tools at the FDA, and not be submitted by sponsors. The reason for this is that such variables jeopardize the model (data redundancy) and lead to errors. For example, I have seen submissions where up to 40% of the --DY values were incorrect! I expect that the same will happen for –LOBXFL in future submissions. This may be highly problematic as reviewers will rely on data that is possibly erroneous due to derivation problems, instead of relying on their own "on-the-fly" derivation (trust is good, control is better).

For example, suppose I am testing a new blood pressure lowering agent, and have following values: 140/95, 120/80 and 122/82, and erroneously, the second one is assigned by the sponsor as "last non-missing value prior to dose" (VSLOBXFL=Y) instead of the first one. Can you imagine what can happen?

I haven't tried yet, but I guess that I can add a feature to the "Smart Dataset-XML Viewer" that highlights the records that contain the last value before exposure by finding it "on the fly". As on other occasions, I think I can program that in maybe 1-2 evenings (see here) ). Now I am not a super-programmer, so I wonder why the FDA (with much more resources than I have) were not able to realize such simple features in their tools in the last 20 years. 

Also following variables have been added: --ORREF (Reference Result in Original Units), --STREFC (Reference Result in Standard Format), --STREFN (Numeric Reference Result in Std Units).
I presume the "origin" in these cases can be "assigned" (but than it is metadata which i.m.o. belongs into the define.xml), or "derived". The document gives the following example: "value from predicted normal value in spirometry tests".
Now I worked some time in this area, and know that such values are usually derived from age and sex of the subject (see e.g., or sometimes using a few more variables (additionally, height, weight, … - see e.g. In such a case, it would be better if the reviewer can generate these reference values himself (so not trust that the sponsor has provided the correct value), e.g. by using a RESTful web service. We did already develop such a RESTful webservice for LOINC codes, and implemented it in the "Smart Dataset-XMLViewer", and I guess it would also be very simple to generate similar RESTful web services for normal values in spirometry.

In case such a reference value is independent from the subject itself (e.g. a fixed value for the specific test), I think it is to be considered as metadata, and should go into the define.xml. I realize that the define.xml needs to be extended for that, based on the "ReferenceData" element in the core ODM.

I will try to add the new feature "highlight last observation before exposure" in the "Smart Dataset-XML Viewer" next week (first taking a few days of vacation…)