Monday, January 9, 2023

CDISC SDTM codetables, Define-XML ValueLists and Biomedical Concepts

Yesterday, I started an attempt to implement the "CDISC CodeTables" in software to allow even more automation when doing SDTM mapping using our well-known SDTM-ETL software.
As the name says it, CDISC has published these as tables, and so far only as Excel worksheets. Unfortunately, this information is not in the CDISC-Library yet, otherwise it would only have costed me a relative simple script to access the CDISC-Library API and a few hours to get all the information implemented as Define-XML "ValueLists".

Essentially, I do not really understand (others will probably say "he does not want to understand") why these codetables were not published as Define-XML ValueLists right from the start. Was it that the authors have limited or no Define-XML knowledge (there are CDISC trainings for that ...) or is it still the thinking that Define-XML is something that one produces after the SDTM datasets have been produced (often using some "black box" software of a specific vendor), rather than using Define-XML upfront (pre-SDTM-generation) as a "specification" for the SDTM datasets to be produced (the better practice). Or is it just still the attitude of using Excel for everything ...: "if all you have is Excel, everything is a table".
Now, I do not have anything against tables. I have been teaching relational databases at the university for many years, and these are indeed based on ... tables. The difference however is that in a relational database, the relations are explicit (using foreign keys), where in all the CDISC tables (including for SDTM, SEND and ADaM), the relations are mostly implicit, described in some PDF files.

When I start looking into the Excel files, I immediately had to say "OMG" ...

Each of the Excel files seems to have a somewhat different format, some with and other without empty columns, and completely different headers. So even when I wrote software to read out the content, I would still need to adapt the code (or use parameters) for each input file to have at least some chance of success. Although far from ideal, I then wrote such a little program, and could at least produce some raw XML CDISC CodeLists, although the results still require a lot of afterwork.

So I started with the DS (Disposition) codetable, which went pretty smooth.

Then I decided to tackle a more complicated one, the codetable for EG (ECG - Electrocardiogram).
I knew this would be a non-trivial one, as the EG domain itself is pretty weird. In contrast to normal CDISC practice, EGTESTCD and EGTEST have 2 codelists as can be seen in the CDISC-Library Browser

i.e. one for classic ECGs and one for Holter Monitoring tests.

Personally, I consider this very bad practice. The normal (good) practice is to have a single codelist, and then use Define-XML ValueLists with "subset" codelists for different use cases. This is a practice also followed by CDISC for other domains, e.g. by publishing a subset-codelist for units specifically for Vital Signs tests.

Also, when creating SDTM datasets, we define subset codelists all the time in our define.xml, e.g. based on the category (--CAT variable), but we also generate a subset codelist with only the tests that appear in our CRFs or were defined in the protocol. For example for LB (Laboratory) we will not submit all 2500+ terms for LBTESTCD and LBTEST, but only the ones we used or planned to use.

But maybe the authors of this part of the standard were unaware of define.xml, subset codelists, and especially Define-XML "ValueLists" and the nice possibility to work with "WhereClauses".

So, the codetable for EG, in Excel format, comes with two tabs: "EG_Codetable_Mapping" and "HE_Codetable_Mapping":

 

That the latter is for the "Holter Monitoring Case" is not immediately obvious: there is even no "README" tab explaining the use cases.

As usual (and unfortunately), there are different sets of columns for the different variables the subsets of codes apply to:


This makes it hard to automate anything to use it in software: either one needs to revamp the columns, or do a huge amount of copy-and-paste (as before the CDISC-Library days).

When comparing the contents of the tabs, things even get more complicated.
Some subset codelists appear in both tabs, others such as the ones for units (for EGSTRESU, depending on the value of EGTESTCD) only in the first. Does this means the units subsets are not applicable to the Holter Monitoring use case?

When then comparing the subsets for the value of EGSTRESC (depending on EGTESTCD) in both tabs, some are equal (e.g. for the case of EGTESTCD=AVCOND), others are different, with a range of only 1 term different, to a larger set of terms being different.

I tried to resolve all this by adapting my software - it didn't work well. So I started doing ... copy and paste ...

This results in subset codelists like:


with some codelists coming in two flavors, one for the normal case and one for the Holter Monitoring case - of course I gave these different OIDs.

For the units, the organization in the worksheet is pretty unfortunate, so e.g. leading to:


stating that for each of EGTESTCD being JTAG, JTCBAG, JTCBSB and JTCFAG the only allowed unit is "msec" (milliseconds) for EGSTRESU.
This is valid for use in Define-XML "ValueLists". The "WhereClause" would then e.g. say:
"Use codelist CL.117762.JTAG.UNIT" for EGSTRESU when EGTESTCD=JTAG".

The better way however is to define one codelist, e.g. "ECG_Interval" and define a WhereClause stating when it should be used for EGSTRESU. This leads to e.g. for the Define-XML ValueList and WhereClause:


with the subset item and codelist defined as:

 

and the ValueList of course assigned to EGSTRESU:

 

Essentially, this is all very related to Biomedical Concepts!
For example the concept "JTAG" (with name "JT Interval, Aggregate" ) would then have the property that it is an ECG test (and thus related to EGTESTCD/EGTEST in SDTM) with the property that the unit for it can only be "msec", at least when using "CDISC notation" for the unit. The better would however be to use the UCUMnotation, which is "ms" and which is used everywhere in health care except for at CDISC ..., and which has the advantage of allowing automated unit conversion, which is not possible with CDISC units.

CDISC has now published its first Biomedical Concepts in the CDISC-Library which can be queried using the Library RESTful API:


For example, for the BC "Aspartate Aminotransferase Measurement", the API response (in JSON) is:

 

As I understand it, CDISC is also working on generating BCs starting from codetables, especially for the oncology domains and codelists, where we have similar dependencies between standardized values (--STRESC) and possible units (--STRESU).

It would then be great if we can see all the by CDISC published codetables published as BCs, and made available by the CDISC-Library through the API. With the SDTM information than added, these then correspond to the ValueLists in the define.xml of our SDTM submission.

But I will start with converting these awful Excel codetables to Define-XML CodeLists and ValueLists (with the corresponding WhereClauses of course) first.

Essentially, it should be forbidden that CDISC publishes standards (and even drafts of them) as Excel files, but it should only be allowed that a real and standardized machine-readable form, such as based on XML or JSON, is used. This would finally allow much better QC for the draft standards (instead of visual inspection!) and make the standards immediately usable in systems and software.

I presume many of you will disagree, so your comments are always welcome!

 

 

1 comment:

  1. Fundamentally too many people in this industry, even within CDISC, are still missing the point of metadata.
    If they aren't willing to learn Define well enough to use it properly, it doesn't bode well for BCs.

    Proper metadata practice (shared public Define up-front as a data contract, with BCs implemented in VLM) needs to be enforced if it is going to happen - either by regulators or by adoption of an industry-wide MDR

    For the latter to happen I'd like to think people simply need to see how much more powerful shared up-front metadata can be i.e. crowdsourcing BCs and code, automating dataset creation, never having a consistency issue ... but really this is a people problem. You can lead a horse to water but you can't make it drink

    ReplyDelete