Working on and with CDISC Standards

Saturday, September 2, 2017

Alexa, can I put mm[Hg] in VSORRESU?

This blog entry is an extension to the discussions on thePinnacle21 forum and a discussion on the “LinkedInSDTM Experts” forum regarding the “standardization” of “original result units”, and the controversial CDISC “UNIT” codelist, and the use of UCUM notation.

We live in an era where artificial intelligence (AI) and machine learning (ML) are booming. But, in order to apply it to CDISC SDTM, we need an information basis. This basis should probably be the SDTM-IG. Unfortunately, the SDTM-IGs are all published as PDF documents, which are essentially not machine-interpretable. This makes the information very hard to implement in software, and e.g. apply AI.

For example, when I work with electronic health records (EHRs), for which I get the blood pressure readings with the unit “mm[Hg]” (UCUM notation – THE international standard for unit notation), and I want to know whether I should copy “mm[Hg]” into VSORRESU, or whether I need to replace it with the CDISC notation “mmHg” from the CDISC “UNIT” codelist, I must (as a human) start “fighting” myself through the SDTM-IG and hope to find an answer there.

If we look at the different versions of the SDTM-IG, we observe that the published (PDF) document grows considerably with each new version. For example, SDTM-IG v.3.1.2 (2008) has 298 pages, whereas the version 3.2 (2013) has 398 pages, excluding the IGs for medical devices (additional 60 pages) and for associated persons (additional 31 pages). When I first took the 2-day SDTM course a number of years ago, the trainer (Peter VR) could explain everything (including all domains and the “assumptions”) in these 2 days. I recently asked him whether this is nowadays still possible and he answered me that it is not, and that the training can only treat the principles, and even not all the domains. Also the complexity has increased, and in my opinion, the “learning curve” has considerably steepened with each new version of the IG, making it very hard for beginners to start working with SDTM. I sometimes think about asking our university to set up a master degree in CDISC-SDTM, with addition of an additional study year at each new published version of the SDTM-IG ...

At the same time, the SDTM validation software used by the FDA (developed by a company with no ties to CDISC) has contributed to the confusion by defining validation rules that are over-interpretations of the SDTM-IG, are just wrong or throw a large amount of false positives.
The “SDTM-IG Conformance Rules” published by CDISC itself were a great step forward, but essentially came too late – they should have been published together with the SDTM-IG itself, and even better, as part of the IG. These rules have also been implemented under the “OpenRules for CDISC Standards”initiative in a modern, completely open format.

SDTM and artificial intelligence (AI)

Instead of needing to have a “master degree in SDTM”, wouldn’t it be better that one (or even better, our computer programs) just queries something like “Alexa, can I put mm[Hg] in VSORRESU?”? Alexa then would probably answer something like “No, Jozef, you need to replace mm[Hg] by mmHg as both are semantically the same and the latter is part of CDISC controlled terminology and the former is not, and CDISC does unfortunately not allow UCUM notation yet”.

What would such an SDTM-AI system need to be able to answer such questions? First it needs to know that “mm[Hg]” and “mmHg” are semantically identical. However, “mm[Hg]” isn’t even listed in the CDISC controlled terminology as a “synonym”, this although it is a term from the international standardized notation in international sciences, healthcare, engineering, and business. Secondly, it would require that the SDTM-IG is in a machine-readable format (PDF isn’t), with clearly defined rules (if possible machine-interpretable, like in the CDISC “SDTM-IG Conformance Rules”).

A first simple proposal for a machine-readable SDTM-IG has been made in the past, but this proposal seems to have gone almost unnoticed by the SDTM team (SDTM team members: please correct me if I am wrong!), and the next version of the IG (likely to have 500 pages or more?) will be published as … PDF. A request to also publish the next SDTM-IG as XML has unfortunately been turned down by the CDISC SDTM team:

In order for the "SDTM-Alexa" (SDTM-AI system) to provide an answer to the question whether "mm[Hg]" (from the EHR) can be put into VSORRESU, the system needs to find the guidance in the SDTM-IG. Here it is (with many thanks to Carlo R for looking up and bringing it up in the discussion - ), from “Assumption 7” in the SDTM-IG, LB (“laboratory”) section:

essentially stating that one should first check whether the own term/unit (in this case “mm[Hg]”) is listed as a “synonym” for “something else” in the CDISC controlled terminology, and if not found, a “new term request” should be submitted to CDISC.
Honestly said, the latter is not a real option, as this process usually takes 6 months or more, and if the request is turned down, zero progress is made.

In our earlier proposed prototype of a SDTM-IG in XML, each “Assumption” is an own XML element instance, for example:

Although structured, this doesn't make it machine-executable nor suitable for AI. In order to make it usable for AI or ML, we need a machine-executable expression or an algorith, which could look like:

Although structured, this doesn’t make it machine-executable. In order to be able to use this in AI, we need a machine-executable expression or an algorithm, which could look like:

a) Submit the “suspected synonym” to a web service (or other system) that looks whether the value (“mm[Hg]” in this case) has been published by CDISC-CT as a synonym, and if so, for what it is a synonym.
b) If the answer is “no” (or “null”), automatically make a request to the NCI

As the latter is not really an option, b) could be replaced by:

c) Extend the “UNIT” codelist in the define.xml with the own term, and put the “own term” in VSORRESU.

For step a) I created a RESTful web service this morning which is documented at:
http://xml4pharmaserver.com/WebServices/#sdtmtermsynonym

If one submits our example “mm[Hg]” to the RESTful web service (http://www.xml4pharmaserver.com:8080/CDISCCTService/rest/SDTMTermFromSynonym/mm%5BHg%5D), one obtains:

Containing an empty response meaning that “mm[Hg]” is not found to be a “synonym” of anything. This is rather strange as this is the mandatory notation in EHRs, but this does not seem to be honored yet by the CDISC-CT team.

Another example would be e.g. that we have measured a concentration in “mol/m3” for which we than submit a request to the RESTful web service with the result:

stating that there IS a synonym for “mol/dm3” and that we need to replace it by “mmol/L” in LBORRESU.

So, all that would be needed is that such an algorithm is expressed as a machine-readable expression, and add it as such (probably by using a child element) to the “Assumption” element in the XML version of the SDTM-IG.

Some other “rules” or “assumptions” that could easily be implemented in a machine-readable SDTM-IG and can then be used for AI are things like that “—DY” values are not allowed to be “0”, so that a system could ask questions like “Alexa, can VSDY be 0?”

This is just one of the first ideas I have for first coming to a machine-readable SDTM-IG and then to “smart SDTM systems” using “artificial intelligence” or “machine learning”. This would not only greatly help flattening the very steep (and with each IG-version becoming steeper) learning curve, but also allow to automate mapping steps that are now done manually and, maybe even more important, help avoiding the many different interpretations of the SDTM-IG.

It however requires that the SDTM development team moves away from generating the SDTM-IG from Word documents (CDISC-JIRA may be of help here), with highly structured content (can be done using a database) and that the team allows specialists from other domains (XML, AI, …) to work with them and have a voice in the development.

We have self-driving cars, but for SDTM, we still rely on 30-year old technology. High time that we do something about this.

Wednesday, August 9, 2017

Implementing SDTM 1.5 in software: first impressions

Last Monday, due to a short break in my vacation (thunderstorms and mountaineering do not well fit together), I started implementing SDTM 1.5 in our popular SDTM-ETL mapping software.
The reason is that some of customers want to start working with SEND 3.1, which is the first implementation of SDTM 1.5. Remark that there is no SDTM-IG yet based on SDTM 1.5, only a SEND-IG.

What were the difficulties encountered? What were my first impressions on how easy or difficult it is to implement SDTM 1.5 in software?

First of all, there is no very good (i.e. a define-XML template) machine-readable version of SDTM 1.5. There is an Excel file available from SHARE, with a list of the variables, and an Excel list of differences with version 1.4. From the former, I could generate an XML file with variables, which I could use for the automated generation of the "CDISC Notes" in the software, and also help me somewhat generating a SEND define.xml 3.1 template for SDTM-ETL.
All the other things had to be done using the "good old" methods, i.e. copy-and-paste from the PDF documents. As I am not paid by the hour, you can guess that I didn't like this too much.

Once the SEND 3.1 define.xml template generated, I could start on the nitty gritty details. They require careful reading of the specification or IG, interprete what is written there, and program it in the software. Interpretation from a specification is always dangerous, as can be seen from the very many false positives generated from the validation software used by the FDA (No, not generated by our company).

The first problem I encountered is that the list of SDTM variables that "is never used in SEND" (see SEND-IG 3.1) does not come as machine-readable information. So, copy-and-paste was necessary.
SDTM-IG 3.2 (based on SDTM 1.4) provided a list of "not generally used" variables. As there is no SDTM-IG yet based on SDTM 1.5, I did not implement this yet, just copied the list of 1.4 instead just for the moment.

The new "--LOBXFL" ("Last Observation Before Exposure Flag") which I already criticized in the past, as it essentially is a derived variable (derived variables do not belong in SDTM), is something I already implemented in the software a few months ago, as I realized it has a major impact on the software. The user can now choose between generating/writing a mapping script himself, or to auto-generate the values during SDTM generation execution. The latter then requires an extra step, as the generated SDTM data needs to be ordered by subject and test, and compared with RFXSTDTC which is in another dataset. It must also be said that the text in the SDTM 1.5 specification is very undetailed. It says "Operationally-derived indicator used to identify the last non-missing value prior to RFXSTDTC. Should be Y or null." It doesn't state anything about whether this is "per unique test". I presume it is (opening the discussion again about what a "unique test" is). I strongly believe standards specification should be exact and precise. The definition of "--LOBXFL" in SDTM 1.5 isn't.
Also remark that "--LOBXFL" is not even mentioned in the SEND-IG 3.1.

New in SDTM 1.5 is also the "Domain-Specific Variables for the General Observation Class" (see p. 23 of the specification). Although I understand the reasons for these, SDTM was always sold to us as containing "generic" variables, applicable to all kinds of clinical research data. I never believed in that concept. One of the reasons is surely that SDTM still wants to represent everything as 2-dimensional tables, although we all know that "the world is not flat and neither is clinical data".
As this (fortunately) short list of variables is only in the PDF, it required some extra programming with another copy-and-paste activity.

Unfortunately, the list also contains an error, or at least a severe unclarity. It states that "EXMETHOD" is such a domain-specific variable, stating "these variables are for use only in a specific domain ...". If we take this literally, this would mean that e.g. EGMETHOD and LBMETHOD are not allowed anymore. Really?
Or was it meant that "--METHOD" may only be used in combination with "EX" in the "Interventions" class? That's my interpretation sofar. But specifications shouldn't be open for different interpretations, isn't it?

A lesser problem is that the SDTM 1.5 specification also contains two new domains which i.m.o. should only appear in the SDTM-IG: "Subject Disease Milestones" and "Trial Disease Milestones" (TM). For the former, I couldn't even find the two-character domain abbreviation, so how could I implement this? I didn't care too much for now, as these two domains do not appear in the SEND-IG 3.1, so I need to wait until the new SDTM-IG is published.

Friday, June 16, 2017

SDTM, XPT and the constitution

Imagine that the constitution of your country would state "cars must be powered by gasoline".
Would you find that acceptable?
Now, we all know that most cars are powered by gasoline, but such a statement in the constitution would give electrical cars no chance at all, even when these are more friendly to the environment.

Something very similar happens at CDISC: the new SDTM Model v.1.6 (so not the Implementation Guide) has been written with only 1 implementation in mind: SAS-XPT format.

Standards models should be developed and published independent of the transport format. A very good example is HL7-FHIR for which there are three technical implementations: XML, JSON and RDF. The documentation has been published in a transport format independent way. It is only when you go to the examples (which you can consider as an Implementation Guide) that you will see something about the transport format.

So, as part of the public review of the SDTM Model v.1.6, I asked the SDTM team to change the text of the model in such a way that it is transport format neutral. This would then allow other transport formats such as XML (e.g. Dataset-XML), JSON and RDF for porting SDTM data in the future.

My request was turned down.
Here is the justification of the SDTM team (snapshot from the JIRA site):

"Considered for future" is the usual expression of the team for "refused".

This answer is indeed a "doom loop": it gives the FDA a reason for further refusing to allow a modern format. When asked about it, they can then say "we can't do that, it is not allowed by the SDTM model".

I have been observing in the last 5-10 years that the SDTM model and standard has been evolved in such a way that all first principles have been thrown away, such as avoidance of data redundancy, no derived data, and separation between model and implementation. This makes it more and more difficult to implement and ruins data quality. Essentially, one can say that it has been steered into a "dead end".

How can this be changed?
I must honestly say that I do not know the answer. "The train has left the station, but is it on the right track?" is a question that is even not posed within CDISC, and especially not within the SDTM team. Maybe the team needs some strong guidance itself, or responsibilities must be reassigned. There are some bright progressive people within CDISC but these are not involved in SDTM. Maybe it is time to give them the lead in SDTM development.

Friday, April 7, 2017

--LOBXFL can seriously damage your health

The addition of the new variable --LOBXFL (Last Observation before Exposure Flag) in SDTM 1.5 remains a controversial topic (as discussed here and here). According to the definition, --LOBXFL is "operationally derived", but the SDTM 1.5 specification does not say "how" it should be derived. There have been several complaints about this during the review period, but they were waved with the argument that they "should be addressed in any implementation guide". I am curious ...
My own request to "please provide guidance" was answered by:

which I don't understand...

Now you may ask why I am so concerned about the addition of this new "derived" variable. Here are some issues:

derived variables should not appear in SDTM. Again, the SDTM team has given in on a request from the FDA caused by primitive and immature review tools used by some FDA reviewers
baseline flags should not appear in SDTM - they belong to ADaM
sponsors should not be asked to do the work of FDA reviewers - the latter have to make their own decisions of which of the data points is "the" baseline data point.
Assigning --LOBXFL is used to "camouflage" bad data quality. SDTM datasets with bad data quality should not be used in submissions and should not be accepted by the FDA.

Let me give an example.

The following is a snapshot of a VS dataset with measurements done on the date "2014-01-02" which is also the date of first exposure (according to EX, and to RFXSTDTC in DM - another unnecessary derived variable) using the open source "Smart Dataset-XML Viewer":

(remark that some columns have been swapped for better visibility)

According to the protocol, all vital signs measurements during this visit must be done before first drug intake. So the sponsor assigned the VSLOBXFL to the diastolic blood measurement with the value "76". What the sponsor however doesn't know, is that the researcher did the measurement immediately after the intake of the medication. As however too often, only the date (as well for the measurement as for the drug intake) was recorded, not the exact time.
Of course, the sponsor could also have assigned VSLOBXFL all three measurements on the date 2014-01-02, but as the standard does not specify "how" the derivation should be made ...
The same applies to the "PULSE" and "SYSBP" (systolic blood pressure) measurements:

If one inspects the data carefully, one will see that each of the "VSLOBXFL" records shows an increased value for that the specific measurent. This increased value may have been caused by the intake of the study drug. However, this is not visible, nor detectable, as no times have been collected (as is very usual) for either the measurement as the drug intake. Even worse, the increased value is marked as the baseline value, which may mean that the reviewer, when looking at later data points, comes to the conclusion that the drug is lowering blood pressure and pulse, whereas it is exactly the inverse...

How does the "Smart Dataset-XML Viewer" deal with such a situation?

One of the options of the "Smart Dataset-XML Viewer" is:

When using it on a data point that is undoubtly (as it is on another day) the last measurement (for a specific test code) before first exposure, the viewer highlights the record:

When however the measurement is on the same day as the first exposure, and either the time part of the measurement or of the first exposure is not provided, the "Smart Dataset XML Viewer" will highlight the record or records and provide a warning:

I pointed the SDTM team to all this in an additional review comment, which was answered as:

to which I responded:

Also the FDA reviewers have free access to the "Smart Dataset-XML Viewer", so they could use it too. On the other hand, the algorithm can also easily be implemented in SAS or any other modern review software.

As a conclusion, --LOBXFL is not only unnecessary, it also camouflages bad data quality. For reviewers, it is even potentially dangerous to trust on it as demonstrated above.
With --LOBXFL, it is just waiting for the first patient having his/her health seriously damaged ...

Sunday, March 5, 2017

Validating SDTM labels using RESTful web services

About a month ago, I reported about my first experiences with implementing the new CDISC SDTM-IG Conformance Rules. I now made considerable progress, having >60% of the rules implemented. These implementations are available for download and usage from here.

Today I want to elaborate a bit on how I implemented rule CG0303 "Variable Label = IG Label", using RESTful web services. Earlier implementations from others were based on copying/pasting the labels from the SDTM-IG and then hard-coding them in software. This does not only mean a lot of work, it is also error-prone, with the disadvantages that a software update is needed each time an error in the implementation is found. For example, if you search on the forum of the validation Software that the FDA is using for the wording "label mismatch" you will find many hits, especially about false positive errors. In some cases, one even gets an error on a label that looks 100% correct, but the software does not tell you what text for the label it expects. "Let the guessing begin"!
So we definitely need something better. Wouldn't it be better to use the SHARE content, load it into a central database, and query that database using a modern (easy-to-implement) RESTful web service?

That is exactly what we did. All SDTM-IG Information (from different IG versions) and all CDISC controlled terminology that is electronically available was loaded into a database, and RESTful web services were developed to make them available to anyone, and to any application. These RESTful web services (over 30 of them) are described here. Adding a new Service usually takes 1-2 hours, sometimes even less.

One of these services allows to retrieve all necessary information for a given variable in a given domain for a given SDTM-IG version. The RESTful query string description is:

http://www.xml4pharmaserver.com:8080/CDISCCTService/rest/SDTMVariableInfoForDomainAndVersion/{sdtmigversion}/{domain}/{varname}

which is pretty self-explaining. For example, to get all the Information about the variable ECPORTOT in the domain EC for SDTM-IG 3.2, the query string is:

http://www.xml4pharmaserver.com:8080/CDISCCTService/rest/SDTMVariableInfoForDomainAndVersion/3.2/EC/ECPORTOT

This service can now easily be used to validate labels in submissions, like in implementations of rule CG0303. Let's do so for a sample SDTM submission.
In our case, the SDTM submission resides in a native XML database (something the FDA SHOULD also do instead of messing around with SAS-XPT datasets). Here is the implementation of rule CG0303 in XQuery, an easy-to-learn language that is as well human-readable as machine-executable (so the rules are 100% transparent):

In the first part, the XML namespaces are declared and the location of the define.xml for this submission is set (usually this will be done by passing these as parameters from within the calling application). Also the base of the RESTful web Service is declared.

Here is the second part:

For each dataset in the submission (by iterating over all the dataset definitions "ItemGroupDef"), we get the domain name either from the dataset name or from the "Domain" Attribute in the define.xml (goes into $domain), and then start iterating over all the variables declared for the current dataset:

The variable name is obtained, and the label taken from the define.xml (remark that when using SDTM in XML, the label is in the define.xml and NOT in the dataset itself - which follows the good practice of separating data from metadata). The web service is then triggered returning the expected label from the database (can be SHARE in future), and the actual and expected label are compared. Remark that for some variables, there will not be a label from the SDTM-IG, as the variable is just not mentioned in the SDTM-IG, although it is allowed for that domain. In that case, there is nothing to compare.

If both Labels do not correspond, an error (in XML) is returned. An example is:

showing as well the actual as the expected label.

As the validation errors ("deviation" or "discrepancy" would in fact be a better word) come in XML, they can (unlike Excel or CSV) be used in many ways, and even ... stored in a native XML database ;-).