Sunday, April 29, 2018

V8 is not a solution!


The CDISC Europe Interchange in Berlin last week was a great success.
There was however one major disappointment: during the classic "Regulatory Authorities Update" session at the start of the conference, Ronald (Ron) Fitzmartin (FDA-HHS) stated (by Skype) that the FDA is considering SAS Transport 8 (V8) as a replacement for SAS Transport 5, and that they have performed first tests, and that they want to try to adapt their systems for V8 until the end of the year. 

If the FDA replaces V5 of the SAS Transport format (also named "XPT format) by V8, this would be a major disaster. It would throw us back by at least 10 years and block any innovation for the next 20 years.


Moving to SAS Transport 8 (v8) would have following advantages:

  • It would lift the 8-, 40-, and 200-character limitations (but replacing them by new ones) 
This was the list of advantages I couldn't find any others.

And this is the list of disadvantages:

  • V8 is "two-dimensional" (tabular) data, without the possibility to express relations between data in the table. These will still require the notorious "RELREC" table. Moving to a world where data is linked and the linking is precisely described is impossible with V8 (as it was with V5)
  • V8 is "US-ASCII only". Just like V5, there is no support for any other characters, even not for "Spanish" characters like "á, é, í, ó, ú, ü, ñ, ¿". We (and also the FDA) needs to take into account that 45 Million people in the USA have Spanish as their first language.   Just like V5, V8 does also not support Japanese nor Chinese characters. As the Japanese PMDA is usually following FDA in their requirements, this would also block PMDA (and other regulatory authorities) in modernization.
  • V8 is NOT vendor-neutral. Although the specification has been published, it is UNUSABLE. Reason is that numeric values still must be written in the extremely old "IBM mainframe representation", which is NOT used by modern computers. The published specification only provides some C-code for the conversion between native floating-point numbers and the outdated IBM representation. This C-code is unusable in modern computer languages like Java, C# and Python (the latter coming up for machine-learning software). SAS has not published any source code for these modern languages for the conversion, and will probably not do so, as the SAS Institute is NOT encouraging V8 for exchange of data in clinical trials (see further). This means that vendors using modern computer languages and that do not want to use SAS are put in a serious disadvantage.
    Be aware that the FDA is, by US law, mandated to be vendor-neutral. It wasn't in the past, and when mandating V8, it will also not be in the future.
             
  •  V8 has no future. It is just a "duct tape" on V5, repairing some of the limitations. But it is not a modern format either. It is not used by any other industry as a standard format. For V5, the clinical research industry is the only one using it. This should also not be the case for V8.
  • I expect that V8 would further increase the file sizes. Just like V5, V8 is a "fixed record length" format. This means that one needs to define the length (in bytes) for each column/variable in advance, and that every cell in each record will then fully take that length in bytes. For example, if one has a comment in an SDTM "CO" dataset that is 700 characters in length, all other comments will also require these 700 bytes, even when the comment is only 10 characters long. The remaining 690 bytes must then be filled with blanks, which is an enormous waste. Like this, especially "CO" and "SUPPxx" datasets may take up several times the file size as they do now. "File size" was the only main issue the FDA encountered when doing the Dataset-XML pilot in 2014. For CO and SUPPxx datasets however, the file size of a Dataset-XML file is usually smaller than that of the V5 file!
    So, when moving to V8, FDA will "shoot itself in the foot", as file sizes will further increase, and especially for the above mentioned datasets, it will obtain just the inverse of what it wishes, as moving to XML (or JSON) would decrease the file sizes of SUPPxx and CO records.
     
  • V8 is not "extensible". All modern formats (XML, JSON, Turtle-RDF) have been created to take up any type of data in any type of language (as these use Unicode). XML stands for "extensible markup language", and JSON and Turtle work exactly the same way. These modern formats allow to start linking to "real world data" (RWD), data from electronic health records, data from the "internet of things" (IOT) world, etc. There is no way to do this using V8, as V8 is just 2-dimensional data in US-ASCII format only without any extension possibility.
  • There is no native mechanism with V8 for support of audit trails, referencing source data, etc. More and more, reviewers will be asking for "show me the source data" in future. With V8, there is no mechanism at all for doing that. Using XML, JSON or any of the RDF serializations, this is easily possible, as already has been demonstrated several times before. Especially "linked data" allows to provide the 2-dimension SDTM/SEND/ADaM data to the reviewer, and at the same time keep the link to the source data, as was recently demonstrated by Dave Iberson-Hurst at the CDISC European Interchange 2018 in Berlin.  
  •  V8 does not have native support for images, audio, . The modern formats mentioned do have this native support by the "base-64-binary" datatype, already supported by CDISC ODM and thus also be Define-XML and by Dataset-XML. 
  • There is no standardized, open query language for V8. XML, JSON and RDF all have their highly standardized and completely open query languages (XQuery, JSON Query, SPARQL, ). There is no such open and free query language for V8. Again, vendors who do not use SAS or do not want to use it are discriminated, which is a violation of the mandated vendor neutrality of the FDA.
  • Even the US "Library of Congress" discourages the use of V5 and V8 for data transport:

  • In order to be able to use the only free "viewer" that supports V8, one needs to register with e-mail address and a password on the SAS Support website. Is this vendor-neutral?

During his statement at the CDISC Europe Interchangelast week in Berlin, Ronald Fitzmartin stated that "V8 is recommended by Phuse". I was a bit surprised by that, and so started looking for any documentation that reflects this. I didn't find anything. The latest document from the "Phuse Alternative Transports" working group doesn't even mention V8. If you know of such a document by Phuse recommending V8, please let me know, and I will add the link here.
I do know that there has been a presentation about V8 at an FDA "public meeting" in 2012 (6 years ago!) where a SAS employee presented V8, but is to be doubted whether that this represents the official point of view of the SAS Institute. 

Interesting for this is an article of the SAS Institutein 2001 (!!!)  and referenced in many articles, blogs and tweets of those people at SAS that are really involved in standards for clinical research (strongly recommended read) where in the section "future directions", recommending XML, it is stated (quote):

"The XPORT format does not allow for extensibility ", and "At first glance, XML looks a little wordy and verbose. The size need not be a problem because public domain ZIP software is readily available that can compress XML nicely.", followed by: "XML can bridge incompatibilities of computer systems and vastly improve web applications, basically because XML tags say what the information is, not what it looks like. This facilitates more precise declarations of content and more meaningful search results across multiple platforms. Once data has been located, it can be handed off to other applications for further processing and viewing."

This was in 2001. We did not make any progress since then. There has been a pilot in 2014 testing the CDISC Dataset-XML format (an open and free standard). The pilot was strongly supported by CDISC and by SAS. Lex Jansen of SAS (also CDISC volunteer) spend considerable time on helping the FDA reviewers implementing the standard in their systems. 
All seemed very successful until the FDA pilot report was published.
Surprisingly, the report "repeated" many of the things we did already know for many many years, such as that XML can handle more than 200 characters and that XML can be imported into SAS without data and information loss (this is publicly known for over 20 years!). The major issue that was reported however was (quote): "Based on the file size observations, DS-XML produced much larger file sizes than XPORT, which may impact the Electronic Submissions Gateway (ESG) and may lead to file storage issues". 


When we then made the argument about zipping (XML can very efficiently be zipped to about 3% or less), we got the answer that zipped files were not acceptable, and when we then asked why, the answer was very vague, essentially saying "because we always did it that way".
 

Since then, nothing has happened regarding Dataset-XML. Why? What did we do wrong that we failed to convince the FDA to use it? I will write a few things down in my following blog entry (somewhere in the next days).

SAS Transport Version 8 (V8) is highly probably the worst choice as a successor for SAS Transport 5. It is just "duct tape" on V5 and a "dead end road", blocking any innovation for the next 10 years.
In my opinion, the pharmaceutical industry and the whole CDISC community should strongly and loudly protest against for using it as a possible format for electronic submissions.


If you too feel that V8 should NOT be considered, please tell the FDA loud and clearly. For example, you could send a short mail to Ronald Fitzmartin at FDA-HHS (I can not publish the mail address here, but one can find it easily on the internet) or to any of your FDA contacts. 

Our submissions deserve better than a "new", but already outdated exchange format.