In my previous posts and conference contributions,
I have already shown that file size
should not be an issue when doing electronic submissions to the FDA,
as once the data loaded in a database, the amount of disk space and
memory usage has become independent of what the transport format was.
But it looks as many regulatory reviewers use the CDISC submission
files "as is", so do not load it into a database at all.
Some of them even seem the use the ancient "SASViewer" as
the only tool to do analysis. The latter does not allow to do
filtering before loading, so a reviewer needs to load the whole
SAS-XPT file into memory
before starting doing filtering. This in contrast to the "SmartDataset-XML viewer" which allows to do filtering before loading
data into memory.
This
has resulted in the famous rule "FDAC036" stating "Variable
length should be assigned based on actual stored data to minimize
file size. Datasets should be re-sized to the maximum length of
actual data used prior to splitting".
Before writing my next blog "Why
rule FDAC036 is hypocritical" I first need to explain how
SAS-XPT format works. In literature one finds very little information
except for the famous TS-140 document "The record layout of
SAS data sets in SAS Transport (XPORT) format", which can only
be understood by IT specialist that understand how the IBM mainframe
number representations work (who still does?).
So let me explain a bit how SAS-XPT
works as a format.
Each SAS-XPT starts with a number of "header" records. These can be seen as containing the metadata of the file contents. The first 6 header records contain general information about the file. Each of these header records is 80 bytes long, when less information is present, the record is padded with blanks. This can be visualized by a punchcard (yes, I used them 40 years ago) that does not have all 80 columns punched:
Each SAS-XPT starts with a number of "header" records. These can be seen as containing the metadata of the file contents. The first 6 header records contain general information about the file. Each of these header records is 80 bytes long, when less information is present, the record is padded with blanks. This can be visualized by a punchcard (yes, I used them 40 years ago) that does not have all 80 columns punched:
One can already see that this is not
very efficient (and not very modern either).
The next records are all 140 character
long. These are the so-called "NAMESTR" records: each of
them contains the metadata of a single variable, like the variable
name (max. 8 characters), the label (max. 40 characters), whether the
variable is "numeric" or "char". Again, blanks
are added to the fields when there is less information. For example,
when the label is "STUDY ID" (8 characters), 32 blanks are
added to the field to make it up to the 40 characters defined for the
variable label.
The 140 character "NAMESTR" records are put together, and then broken into 80 bytes pieces. If that does not exactly fit, additional blanks are added after the last "NAMESTR" record to make the total number of bytes a multiple of 80. So one already sees that with all these added blanks, the "NAMESTR" structure is not very efficient either.
The first record after all the "NAMESTR" records is the "Observation header". It is an 80-character record, just stating that the actual data come after this record. It looks like:
The 140 character "NAMESTR" records are put together, and then broken into 80 bytes pieces. If that does not exactly fit, additional blanks are added after the last "NAMESTR" record to make the total number of bytes a multiple of 80. So one already sees that with all these added blanks, the "NAMESTR" structure is not very efficient either.
The first record after all the "NAMESTR" records is the "Observation header". It is an 80-character record, just stating that the actual data come after this record. It looks like:
After the "Observation header"
come the data, one record for each record in the row. Each of these
has the same length, independent on whether it contains much
information or not: missing information is replaced by blanks. This
makes XPT storage very inefficient. Let me explain with an example.
Let us suppose that I have a variable (e.g. "MHTERM") and
the longest value is "The quick brown fox jumps over the lazy
dog", which is 43 characters. This is also the length declared
in the header (rule FDAC036). In the next record, the value is "Hello
world" (11 characters). In this record, the value will also take
43 characters, i.e. 32 blanks are added. In the third record, the
value is "Yes" (3 characters), so the field is additionally
filled with 39 blanks.
This can be visualized as follows:
The "yellow" bytes are bytes
that are additionally filled with blanks that do not contain
information ("wasted blanks"). One immediately sees that
this is not efficient. The efficiency is 100% for this field for the
(first) record that contains the longest string, but for the second
record, the efficiency already decreases to 26% (11/43), and in the
third record it descreases to 7%. The "overall" efficiency
here is 44%.
The second thing is that SAS-XPT stores
numeric values (making no difference between integers and floating
point numbers) always using 8 bytes. This is even done when the
numbers are small (like the --SEQ or --DY values in SDTM) and could
e.g. be taken care of by a "short" (2 bytes, range from
-32,768 to 32,767. This is not such a big deal, as we often use a "too wide" data type for numeric values anyway in practice: also define.xml does not have the data type "short".
I did an analysis of how efficient XPT
storage is on a real example. It uses the well known "LZZT 2013"
sample submission. I took the SDTM example, and concentrated on the
LB, QS and the SUPPLB files as these are the largest (55, 33 and 33
MB respectively). Although this is a relative "small"
submission, the results of the analysis can and may be extrapolated to
larger submissions (the efficiency does not change with adding more
data to such a submission). The full file with all results can be
obained on request - just drop me an e-mail.
The results are surprising:
QS:
Variable Name | Length | Efficiency (%) |
---|---|---|
STUDYID | 12 | 100 |
DOMAIN | 2 | 100 |
USUBJID | 11 | 100 |
QSSEQ | 8 | 34 |
QSTESTCD | 8 | 92 |
QSTEST | 40 | 65 |
QSCAT | 70 | 59 |
QSSCAT | 26 | 53 |
QSORRES | 20 | 11 |
QSORRESU | 7 | 0.29 |
QSSTRESC | 4 | 26 |
QSSTRESN | 8 | 8 |
QSSTRESU | 7 | 0.29 |
QSBLFL | 1 | 18 |
QSDRVFL | 1 | 26 |
VISITNUM | 8 | 34 |
VISIT | 10 | 40 |
VISITDY | 8 | 22 |
QSDTC | 10 | 100 |
QSDY | 8 | 22 |
One sees that the "storage
efficiency" can be as low a 0.3%, but even more important is to
notice that the "storage efficiency" for the longest field
(QSCAT) does not exceed 60%. The "overall" "storage
efficiency" for this file is 52.0%. So one could state that
this file consists for almost 50% out of (unnecessary?) blanks. Not
very efficient indeed.
For LB we find:
Variable Name | Length | Efficiency (%) |
---|---|---|
STUDYID | 12 | 100 |
DOMAIN | 2 | 100 |
USUBJID | 11 | 100 |
LBSEQ | 8 | 25 |
LBTESTCD | 7 | 50 |
LBTEST | 200 | 7 |
LBCAT | 10 | 60 |
LBORRES | 5 | 60 |
LBORRESU | 8 | 62 |
LBORNRLO | 5 | 45 |
LBORNRHI | 7 | 38 |
LBSTRESC | 8 | 49 |
LBSTRESN | 8 | 66 |
LBSTRESU | 8 | 53 |
LBSTNRLO | 8 | 52 |
LBSTNRHI | 8 | 56 |
LBNRIND | 200 | 53 |
LBBLFL | 1 | 18 |
VISITNUM | 8 | 25 |
VISIT | 19 | 39 |
VISITDY | 8 | 23 |
LBDTC | 16 | 100 |
LBDY | 8 | 23 |
The "overall" "storage
efficiency" being 21%. This result is however biased by the fact
that the length for LNNRIND and LBTEST were not optimized (both were
set to 200). For example, if the length for LBTEST would have been
set to 40, the "storage efficiency" for LBTEST would go up
from 7% to 33%.
For SUPPLB we find:
Variable Name | Length | Efficiency (%) |
---|---|---|
STUDYID | 12 | 100 |
RDOMAIN | 2 | 100 |
USUBJID | 11 | 100 |
IDVAR | 8 | 63 |
IDVARVAL | 200 | 1.3 |
QNAM | 8 | 7 |
QLABEL | 40 | 76 |
QVAL | 200 | 1.4 |
QORIG | 200 | 3.5 |
QEVAL | 200 | 11 |
Supplemental qualifier datasets are
always very difficult: one does not know in advance what values will
go into it (QVAL usually contains a mix of numeric and text values).
Therefore, most implementors set the length for QVAL (and often also
for QORIG) to the maximum of 200, which was also done in this example. This
usually means that the "storage efficiencies" for these
variables are extremely low.
However, if we we would have set the length for QVAL to "3", which is the longest value found (almost all the values are numeric of the form "x.y"), then the efficiency would be near 100%. As QVAL can contain anything, we however also have seen very low "storage efficiencies", for example when one record has a QVAL value having 180 characters (so the length must be set to 180), and all other QVAL values are very short, e.g. "Y" for a xxCLSIG supplemental qualifier.
However, if we we would have set the length for QVAL to "3", which is the longest value found (almost all the values are numeric of the form "x.y"), then the efficiency would be near 100%. As QVAL can contain anything, we however also have seen very low "storage efficiencies", for example when one record has a QVAL value having 180 characters (so the length must be set to 180), and all other QVAL values are very short, e.g. "Y" for a xxCLSIG supplemental qualifier.
As such, and due to the vertical
structure, supplemental qualifiers datasets must be regarded as being
very inefficient, at least when implemented as SAS-XPT. We have
noticed that the equivalents in CDISC Dataset-XML format very often have
considerable lower file sizes than the SAS-XPT representation of
supplemental qualifier datasets. So using XML does not always mean
larger files.
But what would have been the alternatives 20 years ago when the FDA decided for SAS-XPT?
The first and very simplest would of course have been to use CSV (comma-separated values). Therefore, we did a simple test. We transformed the lb.xpt file into CSV. We found that the file size decreased from 66 MB (lb.xpt) to 9 MB (lb.csv). This means a 7-fold decrease in size! So why didn't the FDA select simple "comma-separated values" format 20 years ago (before XML was present) as a submission instead of SAS-XPT? CSV is really vendor-neutral (I do not consider SAS-XPT vendor neutral as it requires almost extinct IT skills in order to implement it in anything other than SAS).
I do not know the answer to the
question, maybe someone can tell me?
If the FDA would insist of having a
binary format (XML is a text-based format), then why didn't they
develop a format which I usually name the "VARCHAR" format.
This kind of format is also used in DICOM, the worldwide standard for
exchange of images in the medical world. Picking up our "quick
brown fox" example again, it works as follows:
Just as most binary formats, it is a continuous format, so no "line breaks" or so.
For each variable, the first byte contains the length of the variable value (e.g. 43), and the following N (in this case 43) bytes contain the value itself. This is immediately followed by the length definition of the following field (in this case 3), followed by the value itself etc..
Like this, the storage is extremely sufficient, much more sufficient than the SAS-XPT format. Why did the FDA never consider such an efficient format?
The best alternative nowadays is however still
XML, at least if one wants to keep the 2-dimensional structure of
SDTM (which I will question in my next blog). In CDISC Dataset-XML, no
"trailing blanks" are ever added, and missing ("NULL")
values are just not in the file. File sizes are however usually stil
larger than for XPT, due to the "tags", describing what the
value is about. XML files can however easily be compressed (e.g. as
"zip", or "tar" of even as "tar.gz",
the latter a format the FDA is using anyway), to file sizes less than
3%, and the resulting files need NOT to be decompressed to be able to
be read by modern software such as the "Smart Dataset-XMLsoftware".
But anyway, what are we talking about?
The usual storage cost of a complete submission study is far below 1
US$ (storage cost of 1GB is US$ 0.02 in 2016).
When loaded into a database, the efficiency of the transport file does not matter at all, with the additional advantage that databases can be indexed to be able to work (and thus review) much faster.
When loaded into a database, the efficiency of the transport file does not matter at all, with the additional advantage that databases can be indexed to be able to work (and thus review) much faster.
But many of the FDA reviewers seem not
to be able to use any of these modern technologies, and thus require
us to implement rule FDAC036, this because the FDA made a bad decision
in the choice for a transport format in the past.
So, up to my next blog: "Why rule FDAC036 is hypocritical".