Saturday, July 14, 2018

RESTful web services for UCUM in CDISC

Today, I undeployed my "UCUM RESTful web services".

It is always sad when stopping something, but this time the reason is a very great one: the US National Library of Medicine NLM is taking over the service and is (as I did) it making freely available to everyone. I am a proud man.

UCUM by the way means "Unified  Code for Units of Measure", and has been developed by the Regenstrief Institute.

The "NLM UCUM Web Services" descriptions can be found at: The API is, apart from the base (the server path) exactly as I developed it 3 years ago. Also the returned XML, JSON or text is very similar to the one returned by the old service, so that users (and their applications of course) can easily adapt to the new situation.
The old web services are still described at but are now out of service.

What does this have to do with CDISC? Everything.

Within CDISC (especially within the SDTM and Controlled Terminology groups), there is still the believe that controlled terminology should consist of lists, so that e.g. in an SDTM and SEND submission, the term can be compared against the list. So, the Controlled Terminology (CT) team developed (and is still developing) lists of units. In the latest CT version (2018-06-29) there are over 680 "unit terms" of which 5 new ones.
UCUM however is not a list, it is a notation, a system. So the number of "UCUM units" essentially is infinite. So one wonders (in the philosophy of the CT team) how one can validate whether a unit is a valid "UCUM unit" when there is no list of it.

The anwser is simple: by using RESTful web services.

One of the methods of the NLM UCUM RESTful web services is a service returning whether a submitted "UCUM unit" is valid is not. Asking yourself how that is (technically) done?
Well, each valid "UCUM unit" can be reduced to a set of base units (meter, second, gram, radian, Kelvin, Coulomb, candela) and this exactly is done by the server. If the reduction to base units is successful, "true" is returned, if it fails, "false" is returned. It's as simple as that.

RESTful web services can easily be used by every modern computer application, including the tools used by the FDA and PMDA. Other RESTful web services exist for getting whether an SDTM/SEND variable is required, expected or permissible, what the exact label is (depending on the CT version), etc.. Although implementing such services in tools is extremely easy, none of the validation tools currently used by the FDA currently uses any of these web services. Why?
Such RESTful web services could also easily be used to automatically update rules that have been implemented incorrectly (or cause false positives) in validation software, but also this is not done. Why? Another "not invented here"? Or are some people trying to make a lot of money with outdated technology using "lists"? We have proposed a much better method, but it requires that the FDA moves to a modern format for electronic submissions.

Back to UCUM and the NLM web services for it. UCUM notation is still now allowed by CDISC. For example, if your source is an electronic health record, blood pressure will for 99+% come with a UCUM notation unit, in this case "mm[Hg]". This will cause an error in SDTM, as SDTM only allows "mmHg" (the "paper" notation) as the one in the list developed by the CDISC-CT team. If UCUM notation would be allowed (it should), it could easily be validated as a correct "UCUM unit" by the NLM RESTful web service.

There are many other advantages of using UCUM notation in CDISC electronic submissions. For example, it allows to automate conversions, including for the SDTM "standardized result". Using these RESTful webservices delivers much more precise and is much less error prone than manually programming them in SDTM mapping software, as is usually done. "CDISC units" do not allow conversions to be done at all. CDISC-CT does even not publish the simplest conversion factors in a machine-readable format.

With the transfer of the "UCUM RESTful web services" to the National Library of Medicine, an extremely well trusted organization, there is not a single reason anymore to further forbid the use of UCUM notation for units in CDISC submissions.

Saturday, June 30, 2018

CDISC-CT 2018-06-29: the madness doesn't stop!

Today, I downloaded the new CDISC-CT version 2018-06-29.

As usual, I start working from the XML file (named "SDTM Terminology.odm.xml") for filling my databases, for use with our free CDISC RESTful Web Services and bringing the XML in a more suitable format for working with modern tools like the "SDTM-ETL" transformation software.

I quickly found out that something is TERRIBLY WRONG: in the XML file, it states "2016-06-29" everywhere (at least 5 times), where it should be "2018-06-29".

So I asked myself: "did the CDISC-CT team do quality control on these files?".

It was good that I found this, otherwise, if I had trusted the file, and used it to fill my databases, the latter would have been corrupted considerably.

Then I had a look at the changes.
I immediately found out that 23 codes have been removed completely (i.e. "deleted"). This is very bad practice! The good practice is to either "deprecate" such codes, or to "flag" them as being "not actual" anymore. So, if a sponsor worked hard on a submission with a slightly older CT version, thinks everything is all right, and then sends it to the FDA, and the FDA uses the latest CDISC-CT (as it usually does), suddenly a large number of validation errors may come up, although the sponsor did not do anything wrong - it is the CDISC-CT team who ignored good practice in standards development!
Also, I wonder why these codes are deleted.  Was it that the CT team recognized that the such a code was not appropriate and thus retreats it? Again, did they do QC on their earlier additions? Or did they just throw in codes with the idea that these can be deleted later anyway?

Worse than entirely deleting codes is to change the meaning of a code. Essentially and principally, this should never happen. Changing the definition of a code is extremely dangerous, as then, one and the same code is used for possibly two different things, dependent on the codelist version. So, the good practice (followed by all other SDOs that I know), is to "deprecate" the old code, and to assign a new code to the object with the new definition.
The CDISC-CT team completely ignores this principle, and just like that, changes the definitions of no less that 184 codes for use in SDTM.

Of course, I already can hear their argument that the definition changes are all minor changes. A change is a change however, and how can a machine understand that a change is minor or not? Machines are not that far yet, only humans can do so, but honestly, are you going to inspect all these 184 codes for which a change in definition was given?
And even then, some changes are serious! For example, for TESTCD=VITB12, the definition changes from "A measurement of the Vitamin B12 in a serum specimen" to "A measurement of the Vitamin B12 in a biological specimen", which is an enormous widening of the scope of the code.
Or did they forget to do QC when introducing that term in the earlier version 2018-03-30?

LOINC coding has been made mandatory by the FDA for laboratory test codes. As LOINC codes are much more precise than CDISC lab test codes, there is no reason anymore for further developing laboratory test codes. Although this is obvious, the CT team has added 188 (hundred and eighty eight) new lab test codes.

Essentially, the value of LBTESTCD corresponds to the "first dimension" of the LOINC name for lab tests, i.e. the "component/analyte" that is measured. If it is the same thing, why don't we at CDISC just align that with the one from LOINC? Wouldn't be that much better?
Some arguments that might apply:
a) LOINC "component" can have more than 8 characters, CDISC-LBTESTCD is limited to 8 characters (Oh my God!)
b) We don't know LOINC (well: you should)
c) We always did it this way ...
d) Not invented here ...
e) Not in LOINC yet

In the latter case, it would be much much better to then do a "new term request" in LOINC, and not in CDISC. The process is similar as for CDISC-CT, and the speed with which a new term can be approved is about the same, but the quality control at LOINC (taken care of by the Regenstrief Institute) is much much better.

For the very first of the "new" lab test codes ("AMBRBTL" = "Amobarbital") I could easily find a LOINC code, for example LOINC code 72399-9 "Amobarbital [Mass/volume] in Blood by Confirmatory method". So why not just take the LOINC "component" "Amobarbital" as the new LBTESTCD? Oh yes, it is more than 8 characters ...

Sunday, April 29, 2018

V8 is not a solution!

The CDISC Europe Interchange in Berlin last week was a great success.
There was however one major disappointment: during the classic "Regulatory Authorities Update" session at the start of the conference, Ronald (Ron) Fitzmartin (FDA-HHS) stated (by Skype) that the FDA is considering SAS Transport 8 (V8) as a replacement for SAS Transport 5, and that they have performed first tests, and that they want to try to adapt their systems for V8 until the end of the year. 

If the FDA replaces V5 of the SAS Transport format (also named "XPT format) by V8, this would be a major disaster. It would throw us back by at least 10 years and block any innovation for the next 20 years.

Moving to SAS Transport 8 (v8) would have following advantages:

  • It would lift the 8-, 40-, and 200-character limitations (but replacing them by new ones) 
This was the list of advantages I couldn't find any others.

And this is the list of disadvantages:

  • V8 is "two-dimensional" (tabular) data, without the possibility to express relations between data in the table. These will still require the notorious "RELREC" table. Moving to a world where data is linked and the linking is precisely described is impossible with V8 (as it was with V5)
  • V8 is "US-ASCII only". Just like V5, there is no support for any other characters, even not for "Spanish" characters like "á, é, í, ó, ú, ü, ñ, ¿". We (and also the FDA) needs to take into account that 45 Million people in the USA have Spanish as their first language.   Just like V5, V8 does also not support Japanese nor Chinese characters. As the Japanese PMDA is usually following FDA in their requirements, this would also block PMDA (and other regulatory authorities) in modernization.
  • V8 is NOT vendor-neutral. Although the specification has been published, it is UNUSABLE. Reason is that numeric values still must be written in the extremely old "IBM mainframe representation", which is NOT used by modern computers. The published specification only provides some C-code for the conversion between native floating-point numbers and the outdated IBM representation. This C-code is unusable in modern computer languages like Java, C# and Python (the latter coming up for machine-learning software). SAS has not published any source code for these modern languages for the conversion, and will probably not do so, as the SAS Institute is NOT encouraging V8 for exchange of data in clinical trials (see further). This means that vendors using modern computer languages and that do not want to use SAS are put in a serious disadvantage.
    Be aware that the FDA is, by US law, mandated to be vendor-neutral. It wasn't in the past, and when mandating V8, it will also not be in the future.
  •  V8 has no future. It is just a "duct tape" on V5, repairing some of the limitations. But it is not a modern format either. It is not used by any other industry as a standard format. For V5, the clinical research industry is the only one using it. This should also not be the case for V8.
  • I expect that V8 would further increase the file sizes. Just like V5, V8 is a "fixed record length" format. This means that one needs to define the length (in bytes) for each column/variable in advance, and that every cell in each record will then fully take that length in bytes. For example, if one has a comment in an SDTM "CO" dataset that is 700 characters in length, all other comments will also require these 700 bytes, even when the comment is only 10 characters long. The remaining 690 bytes must then be filled with blanks, which is an enormous waste. Like this, especially "CO" and "SUPPxx" datasets may take up several times the file size as they do now. "File size" was the only main issue the FDA encountered when doing the Dataset-XML pilot in 2014. For CO and SUPPxx datasets however, the file size of a Dataset-XML file is usually smaller than that of the V5 file!
    So, when moving to V8, FDA will "shoot itself in the foot", as file sizes will further increase, and especially for the above mentioned datasets, it will obtain just the inverse of what it wishes, as moving to XML (or JSON) would decrease the file sizes of SUPPxx and CO records.
  • V8 is not "extensible". All modern formats (XML, JSON, Turtle-RDF) have been created to take up any type of data in any type of language (as these use Unicode). XML stands for "extensible markup language", and JSON and Turtle work exactly the same way. These modern formats allow to start linking to "real world data" (RWD), data from electronic health records, data from the "internet of things" (IOT) world, etc. There is no way to do this using V8, as V8 is just 2-dimensional data in US-ASCII format only without any extension possibility.
  • There is no native mechanism with V8 for support of audit trails, referencing source data, etc. More and more, reviewers will be asking for "show me the source data" in future. With V8, there is no mechanism at all for doing that. Using XML, JSON or any of the RDF serializations, this is easily possible, as already has been demonstrated several times before. Especially "linked data" allows to provide the 2-dimension SDTM/SEND/ADaM data to the reviewer, and at the same time keep the link to the source data, as was recently demonstrated by Dave Iberson-Hurst at the CDISC European Interchange 2018 in Berlin.  
  •  V8 does not have native support for images, audio, . The modern formats mentioned do have this native support by the "base-64-binary" datatype, already supported by CDISC ODM and thus also be Define-XML and by Dataset-XML. 
  • There is no standardized, open query language for V8. XML, JSON and RDF all have their highly standardized and completely open query languages (XQuery, JSON Query, SPARQL, ). There is no such open and free query language for V8. Again, vendors who do not use SAS or do not want to use it are discriminated, which is a violation of the mandated vendor neutrality of the FDA.
  • Even the US "Library of Congress" discourages the use of V5 and V8 for data transport:

  • In order to be able to use the only free "viewer" that supports V8, one needs to register with e-mail address and a password on the SAS Support website. Is this vendor-neutral?

During his statement at the CDISC Europe Interchangelast week in Berlin, Ronald Fitzmartin stated that "V8 is recommended by Phuse". I was a bit surprised by that, and so started looking for any documentation that reflects this. I didn't find anything. The latest document from the "Phuse Alternative Transports" working group doesn't even mention V8. If you know of such a document by Phuse recommending V8, please let me know, and I will add the link here.
I do know that there has been a presentation about V8 at an FDA "public meeting" in 2012 (6 years ago!) where a SAS employee presented V8, but is to be doubted whether that this represents the official point of view of the SAS Institute. 

Interesting for this is an article of the SAS Institutein 2001 (!!!)  and referenced in many articles, blogs and tweets of those people at SAS that are really involved in standards for clinical research (strongly recommended read) where in the section "future directions", recommending XML, it is stated (quote):

"The XPORT format does not allow for extensibility ", and "At first glance, XML looks a little wordy and verbose. The size need not be a problem because public domain ZIP software is readily available that can compress XML nicely.", followed by: "XML can bridge incompatibilities of computer systems and vastly improve web applications, basically because XML tags say what the information is, not what it looks like. This facilitates more precise declarations of content and more meaningful search results across multiple platforms. Once data has been located, it can be handed off to other applications for further processing and viewing."

This was in 2001. We did not make any progress since then. There has been a pilot in 2014 testing the CDISC Dataset-XML format (an open and free standard). The pilot was strongly supported by CDISC and by SAS. Lex Jansen of SAS (also CDISC volunteer) spend considerable time on helping the FDA reviewers implementing the standard in their systems. 
All seemed very successful until the FDA pilot report was published.
Surprisingly, the report "repeated" many of the things we did already know for many many years, such as that XML can handle more than 200 characters and that XML can be imported into SAS without data and information loss (this is publicly known for over 20 years!). The major issue that was reported however was (quote): "Based on the file size observations, DS-XML produced much larger file sizes than XPORT, which may impact the Electronic Submissions Gateway (ESG) and may lead to file storage issues". 

When we then made the argument about zipping (XML can very efficiently be zipped to about 3% or less), we got the answer that zipped files were not acceptable, and when we then asked why, the answer was very vague, essentially saying "because we always did it that way".

Since then, nothing has happened regarding Dataset-XML. Why? What did we do wrong that we failed to convince the FDA to use it? I will write a few things down in my following blog entry (somewhere in the next days).

SAS Transport Version 8 (V8) is highly probably the worst choice as a successor for SAS Transport 5. It is just "duct tape" on V5 and a "dead end road", blocking any innovation for the next 10 years.
In my opinion, the pharmaceutical industry and the whole CDISC community should strongly and loudly protest against for using it as a possible format for electronic submissions.

If you too feel that V8 should NOT be considered, please tell the FDA loud and clearly. For example, you could send a short mail to Ronald Fitzmartin at FDA-HHS (I can not publish the mail address here, but one can find it easily on the internet) or to any of your FDA contacts. 

Our submissions deserve better than a "new", but already outdated exchange format.