Sunday, March 18, 2018

Is SDTM a labyrinth (standard)?

This weekend (starting on Friday) I started implementing SDTM versions 1.5 and 1.6 into our famous SDTM-ETL software, in order to better than ever support the generation of SEND datasets.

It wasn't easy.

Reason of this is that there are no machine-readable versions of SDTM 1.5 and 1.6. One can download some Excel files from SHARE, but that doesn't help very much (I do not consider Excel as "machine-readable" - it is not vendor-neutral either).

For the last 7-8 years, each time a new SDTM (or IG) version becomes available for public review, I request (using the JIRA tracker) that the standard and implementation guide be published in a machine-readable form additionally. Each time, the final "CDISC Disposition" is "considered for the future". You can guess that I am sick and tired of always getting this answer.

So I started making my own machine-readable versions in XML. No, I am not going to donate it to the SDTM team this time (I already have donated so much to CDISC), as this is something THEY should have done.
This is how it looks like (for SDTM 1.6):

Although you might not like it, I used the old name "SDS", as I consider the new naming "SDTM" very confusing, not reflecting that the SDTM-IG and SEND-IG are different implementations of the same standard. I never understood why CDISC made this confusing name change.

What I also changed, are the designations "Char" and "Num" for the data types. Instead, I assigned the "modern" data types "text", "integer", "float", "datetime" and "durationDatetime", as also used by define.xml 2.0 (one of the few modern standards we have at CDISC). 

Adding all the variables and their description (in the IG named "CDISC Notes", in the model "Description", just a matter of being consequent) was the easy part of the work.

Then the labyrinth started ...

SDTM-IG 3.2 and SEND-IG 3.0 (both based on model 1.4) contain a lot of "assumptions" like: "The xxx variables are usually not used for this domain...", which has led to validation software (also used by the FDA) marking such variables as being non-compliant with the standard (and thus needed to be "apologized for" in the reviewers guide). In the SEND-IG 3.1 (based on model version 1.5) however I find:

Does this mean that one can use these variables again without limitations? Are they no longer discouraged?

In my XML file for SDTM 1.4  I still had entries like:

The number of them is 167! I had to copy-paste these from the SEND-IG PDF file manually, as there is no machine-readable version of the IG, from texts like:

For the SDTM-IG 1.4, the number of such "discouraged" variables is 425.
So I added 167+425=692 variables manually to my XML file that will be read into the software in order to have the feature for the user that adding such a variable to a specific domain is at least unusual (and will lead to an error or warning in the validation software that incorrectly interpretes the IG).

For some domains, the number of such "discouraged" variables is even considerably higher than the number of variables listed for that domain in the IG.
SDTM: a standard of prohibitions and exceptions?

This was not the end of the tragedy. SDTM models 1.5 and 1.6 (PDF) contain statements like:

describing that a specific variable may not be used in SDTM implementations (so: only in SEND). Again, not machine-readable. As an implementor (in software) one must carefully go through each paragraph of the PDF document and look at the description to find out.

The very same "description" for "--EXCLFL" also reveals other interesting things. It also states: "Expected to be Y or null". What does this mean "expected"? Is it a non-compliance when the value is "N"? I also "expect" my children no to lie to me ...
And when looking into the IG, one finds for the codelist:

stating that the "NY" codelist must be used. However, when looking into the CDISC-CT for this codelist, one finds 4 entries (!):

Here from the published PDF, which is as well as the published Excel file not machine-readable.
It in ONLY thanks to the XML team (especially Lex Jansen) that nowadays, the CDISC controlled terminology is also published as machine-readable XML.

So, how can my software know that --EXCLFL may not be used in human clinical trial, and that for its SEND implementation in BWEXCLFL the "No Yes Response" codelist must be used, but not all of it, but only the "Y" value (or null)?

My software can't.
Unless I hardcode it in the program or first subset the "No Yes Response" codelist into a "Yes Only" codelist, and then state in my XML file that the "Yes Only" codelist must be assigned to the variable "BWEXCLFL" and to "--EXCLFL" variables in general.
All this needs to be done manually!

And how can my software know that "--EXCLFL" should not be used when the value of "--STAT" is "NOT DONE"? Again, it can't, unless I hardcode it, or have it as a simple, but machine-readable rule in my XML.

It was promised that SHARE would provide such information. It doesn't however yet. Why?

The reason is simple: SHARE is populated AFTER the SDTM/SEND models and IGs are published. So the people doing this must go through exactly the same painful process as I need to, mostly using copy-paste. I hope to hear otherwise from them.
The background is that the SDTM team is still generating the standards starting from Excel and Word, and that they put information that can easily be structured (like "not to be used for human clinical trials") as narrative (text) in a "description".
As far as I know, the SDTM team is not using databases (essentially SHARE is or should be a database) for developing the standards. They should.
If they would do so, generating a machine-readable form of the standard at the end of the process would be the matter of pushing a button.

Is this an illusion? Four of my undergraduate students proved differently.
They generated an XML version of the SDTM-IG 3.2, very highly structured, with an XML element for each variable (and attributes to it), elements for "assumptions" etc.. They did this as part of their "bachelor project" which is a relative small project in the 5th semester of their eHealth study in Graz.
They also developed a stylesheet to transform the XML into HTML and thus display the SDTM-IG 3.2 in a browser. The result (display in the browser) looks almost 100% identical to what one sees in the PDF published by CDISC.

We will present this as a poster at the European CDISC Interchange in Berlin in April this year.

If only 4 students can do this as part of a relative small project, why is the SDTM team (with over 100 members?) not capable of doing so too? "Considered for the future" is just not good enough anymore!

A few other things that catched my eye when doing these implementations:
  • In SDTM 1.6, all mention to "ISO 8601" have been removed. Why?
    Maybe it was argued that that is an implementation issue (these are mentioned in the SEND DART IG)? But then also "Char" and "Num" should be removed isn't it?
  • In a number of cases, important words have been truncated. For example for the variable RPPLSTDY, the "label" is: "Planned Repro Phase Day of Obs Start".
    For a human, it is obvious that "Obs" here means "Observation" and "Repro" means "Reproduction", but again, how can a machine understand this? Or was it again the "40 characters XPT slavery" that made this happen? How can we ever use modern technologies like "machine learning" to come to "smart SDTM systems" if we accept such limitations?
    Ideally, the model should be independent of the transport format. HL7- FHIR shows us how this can easily be done: they have 3 different transport formats (XML, JSON and RDF) for the same model.
  • For many of the (new) variables, I immediately thought: "hey, this would be a resource (or profile) in FHIR". I have been digging a lot into FHIR in the last 1-2 years (who of the SDTM team ever had a look? - I have my doubts) and I get more and more convinced that our standards should be "FHIR-like" with a model independent of the transport format (SDTM is enormously bound to XPT), and using modern terminologies like LOINC (for tests), SNOMED-CT, NCBI, UMLS, ... and modern technnologies like RESTful web services (SHARE is starting to do so).
  • SDTM and IGs very often contain the word "usually". Whereas this would be acceptable in examples or explanations in separate documents, it should never appear in the standard itself.
 For me, it becomes more and more obvious that SDTM with its table structure is a "dead end". The tragedy is that we believed SDTM to be a "database". However, it isn't. It's a "view" on a database, and then even not a good one. Due to all the redundancies, derived variables, flags etc, that have been added on request of FDA reviewers to make it "more reviewer friendly", the real data is hidden, and with each new version, it becomes worse. Or as my good friend and colleague DIH recently stated during a telephone conference: "mapping to SDTM is attempting to add information that was previously lost in the process". However, many people (including reviewers) still believe that SDTM "is the data".
At the start, SDTM (still named SDS at that time) was developed to be a "universal" model for as well human as animal studies. After many years, this proved to be the wrong approach. For example, we now see that the newest versions 1.5 and 1.6 are essentially only meant for animal studies (a lot of "not to be used with human clinical trials" variables), that we now have "general" variables that are only allowed in a single domain (e.g. EGLEAD), and with more exceptions ("this variable is generally not used...") than recommendations (required and expected variables in the IG). I guess that the next SDTM-IG will be based on a new version of the model too (want to bet for a beer on it?).

So, for implementors, SDTM and their IGs have become a labyrinth of definitions that are not definitions, vague rules and hundreds of exceptions to it, and all this in a format that is not machine-readable.

Time for the "Blue Ribbon Commission" that was announced by the new CEO David Bobbitt to rethink a lot ...

Saturday, February 10, 2018

CDISC Biomedical concepts, LOINC and other standards: Microbiology

In a previous blog entry, I discussed the relation between Biomedical Concepts (BCs) and LOINC coding and whether these fit in CDISC SDTM in the case of laboratory tests and vital signs.I argued that the "6 dimensions of LOINC" could very well represent the ingredients of BCs, at least in the case of laboratory tests and vital signs.

But how well does this work for other domains, such as microbiology?

When looking at the microbiology domains in CDISC SDTM, being MB "Microbiology Specimen, and MS "Microbiology Susceptibility", one immediately notices that these are highly related, i.e. the records in the MS dataset provide more detailed information about records in the MB dataset. This can be seen by the fact that MSGRPID is "required" with the description "In MS, used to link to organism in MB". So, there cannot be a record in MS for which there is no record in MB. --GRPID (group ID) is used as a "foreign key" in MS to a record in MB (although the SDTM-IG does not use the word "foreign key"). This is atypical for SDTM, as relations between records in tables usually need to be put in the so-called RELREC ("Related Records") table. As such, the "--GRPID" construct in MB-MS can be seen as a "workaround" or "shortcut" to avoid having the use RELREC (or: "how to violate your own principles").

It also demonstrates the weaknesses of the SDTM as a model, completely relying on the concept of "tables", with or without references to information in other tables, and inconsistently implemented by using RELREC, a group ID or something else for referencing. If SDTM were not tied to the concept of "tables", the information of both MB and MS could be delivered as a single dataset, using e.g. XML, JSON or RDF (linked data) as a format. This would make the work of the reviewers considerably more easy, as they currently need to "swap" between two tables.

Interesting is also that for each organism found (test of presence, reported in MBORRES) there must be a unique identifier in MBSPID ("sponsor ID"). This may sound a bit surprising, but may be due to that MBORRES is essentially free text, so that there may be different "ways of writing" of the same microorganism between visits and essentially between sites. This also means that assigning the organism-identifier to MBSPID can be an enormous task, requiring much microbiology knowledge.

When looking into these two microbiology domains in the SDTM-IG, it is a bit surprising that it does not speak about coding of tests or organism at all. There is a column "MBLOINC", but it is "permissible" and is defined as "Dictionary-derived LOINC Code for MBTEST", which is of course nonsense: LOINC codes should not be derived, they should be the source. Also, nor SNOMED-CT, nor NCBI, nor ATCC coding systems are mentioned. As the sponsor-assigned MBSPID / MSSPID values are used for the unique identifiers of the organism, how can the regulatory authorities than compare microbiology data between studies? Just by visual inspection?

Let us take an example from the SDTM-IG and try to code one or more records using one or more of the above mentioned coding systems. Remark that this should essentially not be done by the sponsor, but the codes should already be present when the tests were done (i.e. in the lab). In the US this should not be a problem, as the Meaningful Use program requires the use of LOINC codes and SNOMED-CT for microbiology: in general, LOINC coding is used to report what test was done (e.g. a viral culture, or a drug susceptibility) whereas SNOMED-CT is used to provide organism identification and in some cases specimen identification (e.g. sputum, urine).

Remark that CDISC developed its own codelist for specimen identification (used by MBSPEC and MSSPEC), and that essentially the use of SNOMED coding is now allowed for these. As far as I know, there is also no mapping table available for mapping between SNOMED-CT codes for "specimen" and CDISC codes.

In the SDTM-IG 3.2, the MB examples show the following:

Comparison of rows 1-2 with the other rows demonstrate that MB is used for both detection of organisms (rows 3-6) as well as for measurement (counting) of specific organisms or types of organisms. The first two rows measure gram-negative bacteria with more specific counting for either cocci (round-shaped) and rod-shaped bacteria. The test codes are however not covered by the CDISC controlled terminology (i.e. they are sponsor-defined) and are thus not interoperable. When looking into SNOMED-CT however, one easily finds the codes 18383003 (gram-negative cocci) and 87172008 (gram-negative rods).

Why the hell was SNOMED-CT coding not used here in MB? Another CDISC "not invented here"? Also the result designation (MBORRES, MBSTRESC) is not standardized at all and thus useless in comparisons between studies - at least MBSTRESC should be standardized.

For the rows 3-6, the LOINC Code to be used is probably 623-9: "Bacteria identified in Sputum by Cystic fibrosis respiratory culture". Remark that LOINC describes tests to be performed, without saying anything about the outcome. The outcome "streptococcus", "klebsiella" then goes into MBORRES. In SDTM, the codes for the bacteria that were detected go into MBSPID, so in this case assigned by the sponsor as "ORG01" and "ORG02", which is of course not interoperable. Another sponsor, or even the same sponsor in another study, may have assigned completely different codes.

We can however easily find codes for "streptococcus pneumonia" and for "klebsiella pneumonia" in SNOMED-CT. For "streptococcus pneumonia" the code is 9861002 "streptococcus pneumonia (organism)" and for "klebsiella pneumonia" the code is 56415008 "klebsiella pneumonia (organism)".

 So, if we would be allowed to code the value for at least MBSTRESC (here using SNOMED-CT), the regulatory reviewers would be able to compare between studies, which is currently not possible. As said, the SDTM-IG does also not say anything about coding for the "standardized result" in MBSTRESC.
If it would be allowed to code for the test, only the LOINC code would be sufficient, and MBTESTCD, MBTEST, MBSPEC and probably also MBMETHOD would be superfluous.

In the MS dataset (microbiology susceptibility), the details and further results for each in the tests in MB (where applicable) are provided.

The test code and test name (MSTESTCD, MSTEST) and also the test category (MSCAT) are under "sponsor defined controlled terminology", i.e. the sponsor can assign codes as he/she wishes. More recently, CDISC has published controlled terminology for these, but this is not visible in the SDTM-IG.
The CDISC lists for MSTESTCD/MSTEST contain a few codes and names for traditional susceptibility tests, but there are no codes for "EXTGROW" or "PENICILIN". So, in practice, MSTESTCD/MSTEST will contain a mixture of test codes and names that semantically have nothing to do with each other (i.e. with no coherence), as phenomenon we regularly observe in CDISC codes.

Some of the MS tests here may be covered by LOINC, other not. For example, "penicillin susceptibility" is covered by the LOINC codes 7041-7, 7042-5, depending on whether "Penicillin G" (sodium or potassium salt) or "Penicillin V" is meant. A quick look into the LOINC database shows that there are about 1700 such "susceptibility" tests described. Interesting for our example is that LOINC dropped the designation "by E-Test" as it is a propriety name. It was replaced by "by gradient strip". For the susceptibility test for the sponsor drug, one cannot expect that there is a LOINC code of course. In such a case, the standardized LOINC value "susc" can however be used for the "property". Remark that also SNOMED-CT has a number of codes for drug susceptibility tests (as "procedures").

All these demonstrates that the MB and MS domains have not been well-designed. Primary cause of this is probably the table structure of SDTM, containing database views instead of representing a true relationaldatabase. Additionally, the refusal to move into multi-hierarchical model for SDTM, the intertwining of SDTM with the SAS-XPT format, and the refusal to use (or even allow the use of) coding systems from healthcare (especially LOINC and SNOMED-CT), i.e. the "not-invented-here syndrome".

It is time that CDISC starts developing Biomedical Concepts (BCs) for microbiology, this instead of further developing "non-invented-here" controlled terminology for microbiology tests. Furthermore, CDISC should start encouraging the use of LOINC and SNOMED-CT for tests and results in microbiology. It should recognize that LOINC is not only about classic laboratory tests (where the FDA overruled CDISC), but is also very useful in other domains.


Saturday, January 27, 2018

Biomedical Concepts, LOINC and CDISC

Very recently, my colleagues from A3 Informatics published a very interesting article on "Understanding Biomedical Concepts", explaing how a "metadata repository" (MDR) with biomedical concepts (BCs) can help developing from the CRF

Unfortunately, most protocols are unprecise in what exactly should be measured. Even our "CDISC Therapeutic Area Guides" (TAUGs) suffer from the same problem. Here is an extract of the "TAUG-Diabetes" which can be downloaded from the CDISC website:

It listens a number of relevant tests, but does not describe them in detail at all. So it leaves it open to the sponsors, CRF developers or in the worse case, to the site to decide what test exactly will be performed. This means that when the FDA wants to compare the results from ten diabetes studies, it might find twenty different ways of measuring e.g. triglycerides, or even worse, find values that originate from different tests but received the same values for LBTESTCD, LBCAT, LBMETHOD in the SDTM. So for FDA reviewers, it looks as the triglycerides measurement from study A is identical to that of study D, although this is not the case at all. The reason is that the combination of LBTESTCD, LBCAT, LBSPEC and LBMETHOD does NOT uniquely describe a laboratory test.

So, I started annotating the TAUG-Diabetes, or at least tried starting. Here are some first results, from page 20: "Lipid Panel":
  • Amylase/Serum: LOINC 1798-8: Amylase [Enzymatic activity/volume] in Serum or Plasma
  • Triglycerides, Serum,Plasma: LOINC 2571-8: Triglyceride [Mass/volume] in Serum or Plasma
  • Total Cholesterol, Serum, Plasma: LOINC 2093-3: Cholesterol [Mass/volume] in Serum or Plasma
and so on. I did not try to annotate everything, that is something that the TAUG development team, with a much better use case knowledge, should have done (they didn').
When it is about lab tests, using LOINC coding is most appropriate. Therefore, it is very surprising that the word "LOINC" does even not appear in the TAUG-Diabetes.
Also remark that such an annotation exercise does not always lead to a single test  (at the contrary). For example for "glucose in serum, plasma", leads to 645 (!) different tests!.
On page 21 of the TAUG-Diabetes, table "Kidney function", a test "blood urea nitrogen" is listed, for which one can find 11 different tests in LOINC:

which will mostly obtain the same combination of "identifying variables values" in SDTM, suggesting the results come from exactly the same test although this may not be true.

Coming back to the Biomedical Concepts (BC). The more I think about it, the more I get convinced that LOINC codes are just implementations of BCs.
The typical example that is given when explaining BCs is "diastolic blood pressure" as a "vital sign". Please see the A3 Informatics article for a picture.

The BC of "systolic blood pressure" consists of the test itself (CDISC coded SYSBP or NCI coded C25298 - the hyperlinks will lead you to the RESTful web services that you can use in your own applications), a body position (sitting, standing, suppine), and a unit (almost always millimeter mercury column), and the result is expressed as an integer or a floating point number.

Is this the complete picture? No, it isn't. "Systolic blood pressure" is one of the tests in a "blood pressure panel", which is part of the "vital signs test panel". In CDISC, the latter is an SDTM domain (VS), but for the middle part ("blood pressure panel") there is no CDISC term as far as I know.

When using LOINC, each of these can be assigned a LOINC code, with the remark that LOINC codes are "pre-coordinated". So you will find a LOINC code for each of the combination of parts of the BC that make sense. SDTM-CT is mostly "post-coordinated", meaning that you take the parts and then assemble them using LBTESTCD (which is not the test code, but essentially the code for the analyte or compound that is measured), LBSPEC (specimen, e.g. "blood", "serum", "urine", ...), LBMETHOD (e.g. "dip stick") etc., and combine these in a record AFTER (therefore "post") you did the test. This post-coordination requires extensive validation in order to find out whether the combination makes sense, whereas you know in advance that the combination makes sense when you use a pre-coordinated code. For example, it does not make sense in SDTM to combine test code "height" with position "sitting", and you will need to write software to check this. In LOINC, you just won't find a code for "body height, sitting".

The more I think about it, the more I get convinced that LOINC codes are "implementations" of BCs. For example, if you take the BC "systolic blood pressure", and select "sitting" for the position, and that you want a number as an outcome ("quantitative measurement), this will lead you to LOINC code

8459-0 (Systolic blood pressure--sitting, quantitative).
BUT, you can also easily find out that this test is a member of the test "blood pressure panel" with LOINC code 35094-2:

but is also is a member of  the panel "Orthostatic blood pressure panel" (LOINC code 34553-8):

which additionally contains 3 "types" of heart rate.

and all of these are "vital signs measurements" (LOINC code 29274-8). Other such "panels" are "Vital signs, weight & height panel" (LOINC Code 34565-2) and "Vital signs, weight, height, head circumference, oxygen saturation & BMI panel" (LOINC Code 85353-1), each of these forming a tree structure. For example for
"Vital signs, weight & height panel" (LOINC Code 34565-2): 

where it is interesting to see that this tree structure also contains "Body position with respect to gravity" (LOINC code 8361-8) with possible values "standing", "sitting" and "lying":

And here we see that even LOINC is not perfect or complete: it does not differentiate between "supine" (lying horizontally with the face and torso facing up) and "prone" (face and torso down), although we do find a (pre-coordinated) term for "systolic blood pressure, supine" (8461-6): I did however not find a LOINC code for "systolic blood pressure, prone", which may be related to the fact that it seems that it does not make a difference for the value itself. When I then looked back to the picture of the A3 Informatics arcticle, I found that it lists "supine", but it doesn't list "prone". However, in the CDISC "VS Codetable", the combination "systolic blood pressure" with "prone" is listed as a valid combination. Something to discuss when we develop and coordinate BCs...

Now, is our BC picture for "systolic blood pressure" perfect? Not at all!
It does not account for tests like "maximum systolic blood pressure in a time period of 24 hours" or "mean systolic blood pressure in 10 hours". These kind of tests cannot be handled by CDISC controlled terminology at all! That such tests are important was recently discussed during the development of the "TAUG-Ebola" where "maximum body temperature within 24 hours" is a very important indicator. But CDISC-SDTM and CDISC-CT could not find a way to represent this test in SDTM! The same applies for ebola to "highest pulse in 24 hours".
So for our systolic blood pressure BC, we still have add information about things that have to do with timing. And this is again where LOINC helps us enormously, as one of the "dimensions" of the LOINC system is the "time aspect". For an "ordinary" systolic blood pressure, the value for the "time aspect" is "Pt" ("point in time). But we also find many other systolic blood pressure tests where this value is not "Pt" (or otherwise said: "now"):

and a number more ...
Similarly, we find different tests with different values for the "time aspect" for "body temperature", "pulse", but also for "glucose in urine", where a very important one "Glucose [Mass/volume] in 24 hour Urine" (LOINC Code 21305-8) is again not well covered by CDISC-CT and SDTM (SDTM suggests to do something with adding "end of collection" to "start of collection" (see further on).

Ready? Looks like ... But how do we state "high systolic blood pressure"? This may be of interest e.g. when the patient was asked "did you already have a high blood pressure five years ago?" and there is no possibility to find out the exact numbers. Also for this, LOINC has codes, for example for the question "Do you have hight blood pressure?", the LOINC code is 64496-3

We made some statements about relations between LOINC codes. How can you find out about these? You can of course start browsing through the "LOINC details" pages and follow links, but a better way is probably to use to write an application using the UMLS RESTful web services of the National Library of Medicine, and filter the results on LOINC as the coding system: the UMLS tries to describe relations between ALL possible coding systems in the medical world (including CDISC-CT). One of the students at the university is currently developing an interactive graphical user interface to build such "networks". This GUI will e.g. allow to filter on LOINC and SNOMED-CT, so that you can also include the SNOMED-CT terms and relationships.

If we look at the different "6 dimensions" of LOINC, the more I get convinced that for the case of vital signs and laboratory tests, these six dimensions form the "ingredients" of the BCs for these domains. For other domains, other coding systems may be suitable. For examples for domains about microorganism, the NCBI coding and taxonomy is probably very suitable. I haven't looked into this however yet. For many other domains, SNOMED-CT is probably very suitable.
And then saying that SNOMED-CT is not used at all in SDTM or CDISC anyway, except for a few parameters in the "trial summary" domain. 

How does this fit with SDTM? It does not well.
For laboratory tests, we already know for a longer time that the "identifying variables" (LBTESTCD/LBTEST, LBSPEC, LBMETHOD, ...) do not uniquely identify lab tests. For "24h urine", the SDTM-IG states that that "the start date/time of the collection goes into LBDTC and the end date/time of collection goes into LBENDTC", which at first glance seems OK. However, when the data is coming from an EHR or from the hospital information (HIS) system, the exact "start of collection" and "end of collection" times will often not be known, and the sponsor will probably (need to) derive LBENDTC by simply adding 24 hours to the collection date/time? You can already guess where this leads to when it is known that it was "24 urine" but only the start collection (or only end collection) date was know with no time. This is what we call "imputation".
In vital signs it is even worse, as has been shown by the "Ebola" case "maximum temperature in 24 hours", which cannot be modeled in VS at all.

So essentially, when the LOINC code is known (from the lab itself or as it was predefined) then there is no reason at all to populate --TESTCD, --SPEC, --METHOD (and --POS in the case of VS), as it is all in the code yet. Even worse, "deriving" these variables may and will lead to confusion. Therefore, in the case the data e.g. come from EHRs, we should use alternative SDTM domains that are suited for EHR or other systems where a precoordinated code is available. For LB, I have made a proposal for such a domain already 3 years ago, which allows to be used both by the pre-coordinated case as well as for the post-coordinated use case. It is a proposal that needs to be adapted for the more general case, also allowing more flexibility in SDTM. For example, if an exact code for a test (whatever the domain is), one essentially only needs to provide three or four values, the code, the code system, the value and the unit. Sometimes additional variables will needed to be added, but whether it is really needed essentially depends on the code system. For example when a LOINC code is used, there is no need at all to provide a categorized "specimen" or "method". But these may be necessary when an NCBI code is used for the microorganism that is tested. What kind of information needs to be added additionally to the code itself is something we need to investigate for each domain (with "we" I essentially mean the SDTM teams).

This will also be a very nice exercise in order to try developing BCs for more complicated cases like for the microbiology domains MS and MB (using NCBI or SNOMED-CT coding) or MI (microcospic findings). In my opinion, CDISC should free resources by stopping developing of some "reinvention of the wheel" codelists and assign these to the development of BCs.

Friday, January 5, 2018

CDISC-CT 2017-12-22: PK Units

A few days ago, I reported about "more madness" in the newest CDISC controlled terminology (version 2017-12-22), especially regarding the addition of more CDISC lab test codes whereas LOINC coding is made mandatory by the FDA anyway. When I see this, I sometimes I ask myself who is the "better standardization organization", the FDA or us, CDISC?

Even the survey that CDISC did on LOINC (under the lead of the former CSO who blocked every progress) was shaped in such a way that it was mostly about the difficulties of LOINC adaption, rather than on any of the advantages and opportunities of the use of LOINC.

But today I want to discuss another part of the new controlled terminology. If you inspect the "changes" file, you will notice that 235 "PK units of measure" have been added, bringing the number of CDISC "PK Units" to a total of 528.
This is crazy! I will explain you why.

Units for PK (pharmacokinetics) usually consist of a relative high number of parts. An example is "g/mL/(mg/kg/day)" (gram per milliliter per (milligram per kilogram per day)). CDISC publishes these "units" as a list, and not as a system (as UCUM does). Taking into account that any of these parts can vary enormously in magnitude (for e.g. the first part from nanogram to kilogram), you can already imagine the number of possible combinations which need to be added to the "list" to come to complete coverage. So, in principle, this list may and will grow to almost infinity. UCUM however is a "system" where any possible combination can be tested on its validity, e.g. using the NLM RESTful web services and website as well as our RESTful web services. UCUM also has the additional advantage that conversions can completely be automated, e.g. also using RESTful web services.

So, what I did was to add the UCUM notation for each of the newly published CDISC "unit". If you want a copy of the file, please just send me an e-mail. We will also soon add these UCUM notations to our web service to find the UCUM notation of any CDISC "unit".

Why does CDISC continue "lists" of units that "must" grow into infinity and can never be complete? Why doesn't it allow UCUM notation, which is used by 99% of the medical world, whereas CDISC "units" is used by less than 1% of the world? "Not invented here"?

A few arguments I have heard or found in the past:

  • "UCUM expressions, in order to support computability, represent familiar units in unfamiliar ways, with curly brackets and other symbols" (CDISC CT team - see image above).
    When I inspect the UCUM notation I assigned to the new "PK units" however, I see nothing "unfamiliar" and even if that would be the case, the advantages (like automation of unit conversions) far outweigh any disadvantages. Furthermore, we must take into account that implementors must also learn the CDISC "units" in addition to their own notation, so this argument is nonsense. Why should people be forced to learn a notation that is not used in the healthcare world anyway? UCUM however is extremely popular in the healthcare world.
  • "The CDISC notation is very similar and strongly overlaps with UCUM notation. So there is no problem". I checked for this case (PK units) and found that for only about 80 of the 235 "PK units", so about one third, the CDISC "unit" and UCUM notation are identical.
  • "UCUM allows some alternative representations, like l or L for liter. For aggregators and others who want to have a single expression, this is not ideal". This is really nonsense! Any computer system can easily be teached that "1L" = "1l". It can even be automated, using a RESTful web service. For example, try for yourself:
UCUM also allows to automate unit conversions easily, as it is a "system" and any UCUM notation can be reduced to a set of base units. For example, try to find out how many "millimeter mercury" corresponds to 25 "pounds per square inch". Can you find the conversion factor from what has been published by the CDISC-CT team?
Using UCUM, this is "piece of cake", as both can easily be reduced to the base units ("g.s-2.m-1" in this case). Using UCUM, the anwer can easily be found using one of the RESTful web services available (and "YES", you can also use these from within your SAS programs). Try it yourself:[psi]/to/mm[Hg]

One of the things I did is the following: I tried to find out for how many of the 235 "PK units" can be converted into one another, using the UCUM notation and using our RESTful web services. Using CDISC notation, this number is zero, as CDISC-CT does not provide any information at all about what the "units" mean and how they relate to other units.

So I wrote a "quick-and-dirty" Java program,  using the aforementioned "UCUM RESTful web services", and found that of the 54,990 possible combinations (235x234), there is a conversion factor for 1158 of them, meaning that the two units represent the same property. For a good number of them, we found that the conversion factor is a power of 10, meaning that they just differ in the order of magnitude (just like "cm" and "m"). For example:

However, when using the CDISC-CT term, there is no way at all to find out what that conversion factor is (remark that in CDISC-CT, the notation "day" is used instead of the internationally recognized notation "d"), or that two "units" refer to the same property.

We also found a number of related terms for which the conversion factor is exactly 1.0. For example:

or: two ways of writing the same unit, which is forbidden by the SDTM-IG.
"Wait a minute" you will say, "these pair members correspond to different properties!".
And yes, you are right, but to what properties?

For example, for the first entry in the above list (here in CDISC notation): "day*g/mL/g" and "mg/mL/(mg/day)" have a conversion factor of 1 (difficult to find out when using CDISC notation) but do indeed correspond to different properties, the first being something like "day times gram (of what?) per milliter (of what?) per gram (of what?). Using CDISC notation, you will never find out about the "what?". Using UCUM, you can easily do so using "annotations".
For example (fictitious - I am not a specialist in pharmacokinetics): "d.g{analyteXYZ}/mL{blood}/g{drug}", explaining very well what the unit is about, without endangering applying conversions - the annotations in curly brackets can be taken into account automatically. These "annotations" is exactly what the CDISC-CT team does not like at all: "... represent familiar units in unfamiliar ways, with curly brackets and other symbols. This is off-putting to some users." (sic). That the annotations enormously help, was even recognized by LOINC, where such annotations have been standardized. E.g.:

showing that LOINC standardized on annotations like "RBCs" (red blood count), "titer", "creat" (creatinine) and many others. Of the over 80,000 LOINC codes, there are over 7,600 having such an annotation in the "preferred UCUM unit", which is almost 10%.

So, rather than extending an ever growing list of "units" over and over again, the CDISC-CT team should better, in close cooperation with LOINC (the Regenstrief Institute), concentrate on standardizing such annotations for use in clinical research.
As LOINC coding for lab tests in SDTM is required by the FDA anyway, use of UCUM notation should be allowed immediately, the CDISC-CT team should stop generating lists of units, and should work on "UCUM annotations" for use in clinical research instead.
This should bring the usability of SDTM to a much higher level than it has today.

Saturday, December 30, 2017

CDISC-CT 2017-12-22: more madness

In my previous posts, I reported about the madness that goes on in the development of CDISC (especially) lab test codes, and the problems related to the CDISC approach. For example, the CDISC approach does not allow to define tests like "maximum in the last 24 hours" or "average over the last 24 hours" (e.g. for vital signs temperature, blood pressure, or concentration of a substance in blood or urine). Such definitions are however an integral part of the LOINC coding system, over the "time aspect".

Now that the FDA has mandated the use of LOINC coding for laboratory tests, it would be expected that CDISC stops the development of an alternative system for lab tests. The latest CDISC controlled terminology (dated 2017-12-22) however again contains over 40 new lab test codes.


There are several reasons for this.

First of all, we need to take into account that CDISC lab test codes are NOT lab test codes, they only specify "what" is measured. This corresponds to the "analyte/component part" in LOINC. So, for example, the CDISC "GLUC" ("glucose") "test code" essentially represents hundreds of different tests where glucose is somehow (presence, qualitatively or quantitatively) measured. So, CDISC-CT is "post-coordinated", meaning that it needs to be combined with content from other variables to uniquely describe a test. In practice, however, this does not work: with the CDISC system, reviewers can never find out whether test A in one study of one sponsor is the same as test B in another study from another sponsor: only the LOINC code can do this, and this is exactly why the FDA started requiring LOINC coding for lab tests.
If we read the latest "Recommendations for submissions of LOINC codes", published by FDA, CDISC and the Regenstrief Institute, we read that even when LOINC codes are submitted, it is still mandatory to populate the CDISC-CT "lab test code" (which it isn't), and all other "classic" "identifying variables" such as the specimen, the method, etc.. I.m.o. this is stupid, as it adds redundancy to the record. For example, if the provided LOINC code has contents that deviate from the contents of LBTESTCD, LBSPEC, LBMETHOD, which of both then contains the truth? The LOINC code or the CDISC test code? I.m.o., this testifies that CDISC is still not ready for giving up it's own system (which is not a system, but just a list based on tradition), but needed to accept the decision of the FDA, though with displeasure. 
One of the arguments of CDISC for their "postcoordination" approach has always been that "research is unique", "does not dictate any tests" and that for many lab tests in research, there is no LOINC code. The latter is essentially not correct, as I have found out in the recent years. I estimate that for over 80% (if not over 90%) of the "test" codes published, there is at least one LOINC code (often many more) in the LOINC system. As I stated, LBTESTCD essentially corresponds to the "analyte/component" part of the LOINC system, and my conservative estimate is that for over 98% of the CDISC "test codes", there is an entry in the "analyte/component" list of LOINC (the latter can be obtained as part of a separarte database from the LOINC website).
The real reason for CDISC not giving up their system is probably (besides "not invented here") that CDISC is sticking to the 8-character limit for LBTESTCD. The "analyte/component" part in LOINC does not have this limitation.

What we see in the newest (2017-12-22) version of the CDISC-CT for lab tests is that for almost each of the new terms (when looking at the "CDISC definition", a corresponding entry in the "analyte/component" part can be found. The only major difference is that CDISC then additionally assigns a <8-character code to it. So, we are seeing the CDISC LBTESTCD values evolving into an <8-character representation of the "analyte/component" part of LOINC - if it wasn't that yet.

In the next months, I want to try to do some research on how "equal" LBTESTCD/LBTEST is with the "analyte/component" part, using a quantitative approach, for example by text comparison techniques like by calculating the "Levenshtein distance" between the value of LBTEST (or the CDISC definition) and the "analyte/component" part of LOINC.
The hypothesis of my research will be that LBLOINC is nothing else than a copy of the "analyte/component" part of LOINC, but then restricted to 8-characters.

If the hypothesis is found to be true, we might as well replace LBTESTCD/LBTEST with the "analyte/component" part of LOINC if we do want to keep a "post-coordinated" approach for SDTM (which I doubt we really need). This essentially would mostly correspond to what I proposed a few years ago in my article "An Alternative CDISC-Submission Domain for Laboratory Data (LB) for Use with Electronic Health Record Data", which i.m.o. combines "best of both worlds".

In order to have such a "best of both worlds" approach (my article can just be a starting point), we however need to remove the 8-character limitation on xxTESTCD, which is there for historical reasons only, and not for any technical reasons anymore. The SDTM team however seems not to be prepared to change anything there.

In my next blog entry, I will probably write something about the more than 230 "PK units" that have been added to the newest CT version, although there is a UCUM notation for each of them. 
Unfortunately, the title of that post will probably also need to contain the wording "CDISC-CT madness" ...

Sunday, December 17, 2017

The future of SDTM

Today, I looked into the newly published SDTM v.1.6 and the new SEND-IG-DART)

This new version is solely meant for SEND-DART (non-clinical sumission datasets: Developmental and Reproductive Toxicology). When going through both new standards, there were quite a number of very disturbing things (at least for me):

  • There are no machine-readable files. The SDTM v.1.6 comes as HTML, the SEND-IG-DART as a PDF. Essentially, this means a lot of frustrating copy-and-paste for those who want to implement these standards into their systems and software.
  • As there is no machine-readable version, all the "rules" and "assumptions" are not machine-readable anyway, thus leaving them open for different interpretations. It is then also foreseeable that a certain company that is working for the FDA will "highjack" the interpretation of the rules and use it for commercial purposes.
  • This version of SDTM is solely meant for SEND-DART. This is very worrying. SDTM which was named "SDS" (Submission Data Standard) in earlier days and has always be meant to be a "universal" model for both SDTM (human trials) as for SEND (non-clinical / preclinical studies). Here is the "big picture" (copied from the CDISC website):

    When starting naming different "flavors" of the "universal" SDTM standard "versions", we are doing something really wrong. "Standards Versions" should be subsequent, a newer version replacing the older one. Unfortunately, this is not the case anymore.
  • We see more and more that some SDTM variables are only allowed/meant to be used in a single domain. Also here, some new veriables have been added which can only be used in one or only a few domains. For me, this evolution demostrates the failure of the SDTM model anyway.
  • The model is again tightly coupled to the outdated SAS-XPT format: variable names and test codes not longer than 8 characters, labels not longer than 40 characters and values not longer than 200 characters. Only US-ASCII is allowed. Such a direct coupling between a model and a transport format is nowadays as an "absolute no-go" in modern informatics.
  • As in prior versions, model and IG contain a lot of derived variables. As SDTM is essentially about "captured data", derived variables should not be in SDTM
  • Also this version of SDTM sticks to "tables" (2-dimensional). Now, there is nothing against tables, but in order to guarantee high quality data, there should be explicit relations between the tables, without any data redundancy. This is what relational databases are based on.
    SDTM however breaks almost every rule of a good relational database, with lots of data redundancy (inevitable leading to reduced data quality), with many unnecessary variables added "for the sake of ease of review" (sic), but essentially ruining the model.
So, what can be done? How should the future of SDTM look like? Let us make a "5-year plan". We can divide this short term, middle term and long term actions.

Short term actions
  • In case FDA and PMDA cannot guarantee that they will accept Dataset-XML very soon, replace SAS-XPT by a simple, easy-to-use, vendor-neutral format that does not overstrain FDA and PMDA.
    It is now clear that FDA and PMDA do not have the capability (or do not want) to switch from XPT to the modern Dataset-XML format. Concerns about file sizes ("a submission might not fit on a memory stick") and inexperience with XML anyway look to be the current "show stoppers".
    As a temporary solution (the "better than nothing solution") but already solving a lot of the limitations of SAS-XPT, a simple "bar-delimited" (also named "pipe-delimited") but UTF-8 encoded simple text files can be used ("HL7-v2 like"). For example (LB dataset):

    Such datasets are very compact, on an average take only 25% of the corresponding SAS-XPT file size, and are easy to import into any software. There is no 8-, 40-, or 200-character limit, and can easily handle non-ASCII characters such as Spanish (USA) and Japanese (Japan) characters.
    The metadata is all in the define.xml, but if also this is a problem for the systems at the regulatroy authorities, the first row can contain the variable names.
    Acceptance for this format could (technically) easily be established in a period of 6 months or less. However, this step can be skipped when FDA and PMDA implement Dataset-XML within a reasonable (<2 years) time.
    Once this done, we are at least freed from the "hostage" of the SAS-XPT format limitations, allowing us to take the next steps (SAS-XPT is currently the "show stopper" for any innovation). The acceptance of XPT should then be stopped within 2-3 years by the FDA and PMDA, to allow sponsors to adapt, this although "bar-delimited" and Dataset-XML files can easily be generated from XPT files.
  • Stop developing controlled terminology for which there is considerably better controlled terminology in the healthcare world. This comprises controlled terminology for LBTESTCD, LBTEST and UNIT. Investigate whether this should also apply to other CDISC controlled terminology (e.g. microorganisms?).
    This step does not mean that the use of the already developed terms is not allowed anymore, but it means that no effort is wasted in developing new terms anymore. Also remark that this may mean that some subteams are put on hold.
Mid term actions

Once we are "freed" from SAS-XPT, we can take the next steps
  • Decide which controlled terminology should be deprecated. For example, I expect the "COMPONENT" part of LOINC to be a better alternative for LBTESTCD/LBTEST. Databases for these are already available. Remark that "COMPONENT" in LOINC is limited to 255 characters in length, so considerably more than the ridiculous 8 characters in LBTESTCD. But that is not a problem as the transport format has no length limitations for fields at all.
    The "deprecation time" in which the old terminology is faded out can then be agree e.g. to be 5 years. For UCUM, I think the case is clear: we can no longer afford to disconnect from e-healthcare
  • Considerably improve our relationships with other SDOs (HL7, Regenstrief, NLM) in healthcare, not considering them as "the enemy" anymore, but being prepared to learn from them, even deprecating some of our standards in favor of well-established ones in healthcare.
  • As SDTM is not fit for e-Source, develop new Findings domains that are fit for e-Source, probably using LOINC and other modern coding systems as identfiers for tests. As long as not everything is e-Source, these domains will probably be in parallel with the existing domains. This time do it good: do not allow for derived and data-redundant variables.
    This step is not as easy as it looks: it would mean that for these domains, SDTM becomes a real relational database, which also has the consequence that the data-redundant variables that were introduced "for ease of review" will not be present anymore (leading to higher data quality), and that review tools at the regulatory authorities will needed to be adapted, i.e. that they will need to implement "JOINS" between tables (for relational databases, this might mean creating "VIEW" tables).
    This step will require a change in mentality for both CDISC and the regulatory authorities: for CDISC from "we do everything the FDA/PMDA (reviewers) ask us", to a real partnership, where CDISC helps the FDA and PMDA implementing these new, improved domains. This may mean that CDISC's own consultants work at FDA and PMDA for some time to help adapting their systems. This looks more difficult as it is, as it essentially reduces in implementing foreign keys and creating "VIEW"s on tables. Essentially, it would also mean that CDISC and FDA work together on the validation rules and their technical implementation, so that high quality validation rules and implementations of them become available, very probably as "real open source" (the current validation software used by the FDA and PMDA is less than suboptimal and based on the own interpretation of the SDTM-IGs by a commercial company)
  • Switch from a simple transport to a modern transport format (which might, but must not be Dataset-XML), allowing for modern review using RESTful web services, as e.g. delived by the National Library of Medicine and others, allowing "Artificial Intelligence" for considerably higher quality (and speed) of review.
  • Start thinking of the SDTM of the future. Must it be tables (the world is not flat, neither is clinical data - Armando Oliva, 2009)? Execute pilots with submissions of "biomedical concepts", and "linked data" using a transport format that is independent from the model.
Whereas the "short term" can be limited to something like 6 months, the "middle term" will probably take something like 2-3 years. This step will surely require "bye-in" from FDA and PMDA. Within CDISC, there is so much expertise which is currently not used: we do have a good number of brilliant volunteers (some of we lost to other organizations such as Phuse) who can help bring ourselves and the regulatory authorities to the next level of quality in review.

Long term actions
  • SDTM is highly probably not the ideal way to submit information to the regulatory authorities. Even when "cleaned", removing unnecessary and redundant information, a set of tables should not be the "model", it should only be one of the many possible "views" on the data. At this moment, only the "table view" is essentially used, or it must be that some (but not all) reviewers have their own "trick box" (own tools) to get more out of the data.
  • In the "middle term" period, we should already start looking into using "biomedical concepts" for submission, following ideas already developed by some of our volunteers and a number of companies. We might even already do pilots with the regulatory authorities at this point
  • In the "long term" we must come to a better way of submitting information, part of which will be in the form of "biomedical concepts". When looking at HL7-FHIR, I see that their "resources" and "profiles" are extremely successful and very near to what we need in clinical research, also for submissions.
  • Work together with other organizations to come to a single model for care and research in the medical world. With the upcome of wearables, site-less studies, interoperable electronic health records in many countries, we can no longer afford to work in isolation (or even claim that clinical research is "special").
Personally, but who am I, I would not be surprised that we make it happen that e.g. FHIR and CDISC standards evolve into a single standard in 10 years from now.

Saturday, December 2, 2017

An e-Protocol Annotation Tool

As part of my professorship in medical informatics at the Institute of e-Health at the University of Applied Sciences FH Joanneum, I also have a little bit of time to do some more "basic" research. This research is often not funded, as "applied sciences" universities in Austria only get money from the state for teaching activities.

In the last days, I started working on a clinical research protocol tool.
It is still extremely primitive, but I want to share my first results with you anywhere.

The tool makes extreme usage of RESTful web services, for example the NLM RESTful web services, web services from HIPAA (require an account and token), UMLS RESTful web services, and of course, our own RESTful web services.

CDISC-SDTM annotation

Much of the information from the protocol finally goes in the SDTM submission to the FDA or PMDA. For example, there is a lot of information that goes in the SDTM "TS" (trial summary) dataset. The protocol can be annotated with the information about where the information needs to go into the TS dataset and with which parameter name.
The same info also goes into clinical trial registry submissions, ideally using the CDISC CTR-XML standard.
Here is a short demo about how the annotation works:

... and so on ...
As one can see, the user cannot only annotate the part where the code should be assigned to (yellow), but also the value of the code or parameter (green).
This one of course is a "easy pray" for an artificial intelligence program. So in my opinion, assigning and retrieving such "trial summary parameters" can easily be automated.

LOINC annotation

With this tool, annotating laboratory tests with their LOINC code becomes very easy. A simple demonstration is shown here:

SNOMED-CT annotation

For SNOMED-CT annotation, I used the UMLS RESTful web services API. Please remark that these require a UMLS account and API token, and possibly a (country) SNOMED-CT license. A short demo is shown here:

If you do not have a UMLS account and API token, you can of course always a "Google Search" which can be started from within the tool.

Other types of annotations that can currently be used are UMLS, ICD-10 (WHO) and the ATC (Anatomic, Therapeutic, Chemical) classification system for therapeutic drugs.