Saturday, September 2, 2017

Alexa, can I put mm[Hg] in VSORRESU?

This blog entry is an extension to the discussions on thePinnacle21 forum and a discussion on the “LinkedInSDTM Experts” forum regarding the standardization of original result units, and the controversial CDISC UNIT codelist, and the use of UCUM notation.

We live in an era where artificial intelligence (AI) and machine learning (ML) are booming. But, in order to apply it to CDISC SDTM, we need an information basis. This basis should probably be the SDTM-IG. Unfortunately, the SDTM-IGs are all published as PDF documents, which are essentially not machine-interpretable. This makes the information very hard to implement in software, and e.g. apply AI.

For example, when I work with electronic health records (EHRs), for which I get the blood pressure readings with the unit mm[Hg] (UCUM notation THE international standard for unit notation), and I want to know whether I should copy mm[Hg] into VSORRESU, or whether I need to replace it with the CDISC notation mmHg from the CDISC UNIT codelist, I must (as a human) start fighting myself through the SDTM-IG and hope to find an answer there. 

If we look at the different versions of the SDTM-IG, we observe that the published (PDF) document grows considerably with each new version. For example, SDTM-IG v.3.1.2 (2008) has 298 pages, whereas the version 3.2 (2013) has 398 pages, excluding the IGs for medical devices (additional 60 pages) and for associated persons (additional 31 pages). When I first took the 2-day SDTM course a number of years ago, the trainer (Peter VR) could explain everything (including all domains and the “assumptions”) in these 2 days. I recently asked him whether this is nowadays still possible and he answered me that it is not, and that the training can only treat the principles, and even not all the domains. Also the complexity has increased, and in my opinion, the “learning curve” has considerably steepened with each new version of the IG, making it very hard for beginners to start working with SDTM. I sometimes think about asking our university to set up a master degree in CDISC-SDTM, with addition of an additional study year at each new published version of the SDTM-IG ...

At the same time, the SDTM validation software used by the FDA (developed by a company with no ties to CDISC) has contributed to the confusion by defining validation rules that are over-interpretations of the SDTM-IG, are just wrong or throw a large amount of false positives.
The “SDTM-IG Conformance Rules” published by CDISC itself were a great step forward, but  essentially came too late – they should have been published together with the SDTM-IG itself, and even better, as part of the IG. These rules have also been implemented under the “OpenRules for CDISC Standards”initiative in a modern, completely open format.

SDTM and artificial intelligence (AI)
Instead of needing to have a “master degree in SDTM”, wouldn’t it be better that one (or even better, our computer programs) just queries something like “Alexa, can I put mm[Hg] in VSORRESU?”? Alexa then would probably answer something like “No, Jozef, you need to replace mm[Hg] by mmHg as both are semantically the same and the latter is part of CDISC controlled terminology and the former is not, and CDISC does unfortunately not allow UCUM notation yet”.  

What would such an SDTM-AI system need to be able to answer such questions? First it needs to know that “mm[Hg]” and “mmHg” are semantically identical. However, “mm[Hg]” isn’t even listed in the CDISC controlled terminology as a “synonym”, this although it is a term from the international standardized notation in international sciences, healthcare, engineering, and business. Secondly, it would require that the SDTM-IG is in a machine-readable format (PDF isn’t), with clearly defined rules (if possible machine-interpretable, like in the CDISC “SDTM-IG Conformance Rules”).
A first simple proposal for a machine-readable SDTM-IG has been made in the past, but this proposal seems to have gone almost unnoticed by the SDTM team (SDTM team members: please correct me if I am wrong!), and the next version of the IG (likely to have 500 pages or more?) will be published as … PDF. A request to also publish the next SDTM-IG as XML has unfortunately been turned down by the CDISC SDTM team

In order for the "SDTM-Alexa" (SDTM-AI system) to provide an answer to the question whether "mm[Hg]" (from the EHR) can be put into VSORRESU, the system needs to find the guidance in the SDTM-IG. Here it is (with many thanks to Carlo R for looking up and bringing it up in the discussion - ), from “Assumption 7” in the SDTM-IG, LB (“laboratory”) section:

essentially stating that one should first check whether the own term/unit (in this case “mm[Hg]”) is listed as a “synonym” for “something else” in the CDISC controlled terminology, and if not found, a “new term request” should be submitted to CDISC.
Honestly said, the latter is not a real option, as this process usually takes 6 months or more, and if the request is turned down, zero progress is made.

In our earlier proposed prototype of a SDTM-IG in XML, each Assumption is an own XML element instance, for example:

Although structured, this doesn't make it machine-executable nor suitable for AI. In order to make it usable for AI or ML, we need a machine-executable expression or an algorith, which could look like:

Although structured, this doesn’t make it machine-executable. In order to be able to use this in AI, we need a machine-executable expression or an algorithm, which could look like:

a) Submit the “suspected synonym” to a web service (or other system) that looks whether the value (“mm[Hg]” in this case) has been published by CDISC-CT as a synonym, and if so, for what it is a synonym.
b) If the answer is “no” (or “null”), automatically make a request to the NCI 

As the latter is not really an option, b) could be replaced by:

c) Extend the “UNIT” codelist in the define.xml with the own term, and put the “own term” in VSORRESU.

For step a) I created a RESTful web service this morning which is documented at:
If one submits our example mm[Hg] to the RESTful web service (, one obtains:

Containing an empty response meaning that “mm[Hg]” is not found to be a “synonym” of anything. This is rather strange as this is the mandatory notation in EHRs, but this does not seem to be honored yet by the CDISC-CT team.

Another example would be e.g. that we have measured a concentration in “mol/m3” for which we than submit a request to the RESTful web service with the result:

stating that there IS a synonym for mol/dm3 and that we need to replace it by mmol/L in LBORRESU.

So, all that would be needed is that such an algorithm is expressed as a machine-readable expression, and add it as such (probably by using a child element) to the “Assumption” element in the XML version of the SDTM-IG.

Some other rules or assumptions that could easily be implemented in a machine-readable SDTM-IG and can then be used for AI are things like that “—DY values are not allowed to be 0, so that a system could ask questions like Alexa, can VSDY be 0?
This is just one of the first ideas I have for first coming to a machine-readable SDTM-IG and then to smart SDTM systems using artificial intelligence or machine learning. This would not only greatly help flattening the very steep (and with each IG-version becoming steeper) learning curve, but also allow to automate mapping steps that are now done manually and, maybe even more important, help avoiding the many different interpretations of the SDTM-IG.

It however requires that the SDTM development team moves away from generating the SDTM-IG from Word documents (CDISC-JIRA may be of help here), with highly structured content (can be done using a database) and that the team allows specialists from other domains (XML, AI, ) to work with them and have a voice in the development.

We have self-driving cars, but for SDTM, we still rely on 30-year old technology. High time that we do something about this.



No comments:

Post a Comment