Sunday, September 28, 2025

SDTM-ETL and R: Best of both Worlds?

Some time ago, there was an interesting discussion on LinkedIn (unfortunately I cannot find it anymore) about R versus SAS for generating SDTM datasets. My response was that I considered both inefficient and expensive (in training cost and/or license costs) as there are now software tools on the market that have a graphical user interface (GUI), use "drag-and-drop", combined with wizards, to automate the mappings and generate the SDTM dataset. Examples of such software are our own SDTM-ETL, but also Ryze by Formedix (now owned by Certara) and XClinical's Marvin-Tabulator. This triggered a discussion about advantages and disadvantages of such "specialized for SDTM" software versus general purpose software such as R. One of the arguments by a contributor (not me) was that R is more flexible, whereas SDTM-specialized software such as SDTM-ETL, due to the user-friendliness, will probably be considerable more efficient for 80% of the studies. For very complicated studies however, and especially for studies with low-quality source data, R would then be the better choice due to the flexibility. Another contributor then asked whether there is "SDTM-specialized" software, that has a user-friendly GUI, uses "drag-and-drop", has build-in SDTM knowledge, but uses R as the scripting language.

As far as I know and could find out, there is no such software available.

SDTM-ETL uses "EasyXSL" as the scripting language (80-90% of the scripts are generated automatically), which is an easy-to-learn language that is on top of XSLT, the very powerful XML transformation language. This means that upon execution of the mappings, the EasyXSL scripts are combined (per domain or dataset definition) and are translated into XSLT, which is then executed on the source data, with generation of the SDTM datasets in either SAS-XPT, modern CDISC Dataset-JSON-1.1 format, or CSV.

So the idea was born: is it possible to have a version of SDTM-ETL, with all its advantages, which uses R as the scripting language instead of EasyXSL? This would mean that the by SDTM-ETL generated R-scripts can be fed into an R engine, and then be executed by that R engine. Essentially it would mean that it would be possible to combine the user-friendliness (and build-in SDTM knowledge) of SDTM-ETL with the flexibility and extensibility of R.
This could then be a "best of both worlds".

So we decided to start developing a prototype (just as a "proof of concept") of such a software. For the moment, we call it "SDTM-ETL-R".

What is important for us is that all existing "drag-and-drop" features and "SDTM knowledge" and  "wizards" remain intact, and that only difference is that the final generated script is now in R instead of in EasyXSL. For the moment, the execution is then done in R-Studio.

We started with DM as the first domain, but this was pretty easy, as DM only has 1 record per subject. For example, the mapping for RACE, automatically generated by SDTM-ETL, is then found to be:


 

and when tested on a set of clinical data (in CDISC ODM format), as exported from an EDC system, using the button "Test - Run in R":

and providing a file with clinical data by a "file chooser" (not shown here), the script is extended and reorganized in order to run for each subject in the source, and then opens in R-Studio:

 

When then run, the "DataFrame" is generated and displayed in R-Studio:

 

As this was "too easy", we started working on LB (Laboratory), which is always more difficult. 
The output from the EDC system (in ODM) is organized as follows:

  

Do not mind the colors too much for the moment - it just means that the items have annotations attached, already stating where they are supposed to come in SDTM. 
We always first start with the "topic" variable, which is LBTESTCD in this case. What we need to accomplish is to obtain one record per test (described by LBTESTCD/LBTEST) per visit per subject. As one can easily deduce, there are two tests here: "Erythrocytes" and "Leukocytes" (this is very simple example study for demo purposes only).
The target SDTM variable is shown in the table on the right side of the screen:

So we drag either "Erythrocytes" or "Leukocytes" from the left to the cell "LB.LBTESTCD". When we drag-and-drop "Erythrocytes", a dialog appears:

It automatically suggests that we take the identifier (Item-OID) from the source, not the value.
But we also want to have an entry in the output for Leukocytes, not only for Erythrocytes, and we want to have that for each visit.
So we select the checkbox "Generalize for all StudyEvents", and also check the checkbox "Generalize for all items", followed by clicking "Only for":

 

A list is then displayed from which we need to choose - we select "Erythrocytes" and "Leukocytes":

 

We do not want to use any of the other items for now, we will need them later to e.g. populate LBSTAT, LNORNRLO, LBORNRHI, LBNAM, etc..

Clicking OK then leads to the next "wizard":

 

The system loads the SDTM Controlled Terminology (CT) for LBTESTCD from the underlying define.xml, and presents all the values in the comboboxes:

 

Again we are lazy and allow the system to find out what the mappings need to be based on word similarity, if necessary additionally supported by CDISC's "synonym list", or our company synonym list. The latter can e.g. be useful when the lab test names are obtained in another language than English. So, we click "Attempt 1:1 mappings". The system, in all its wisdom, then proposes:

 

which looks fine …

At this stage, we can also generate the subset codelist for LBTESTCD and have it stored in the underlying define.xml.
Reason is that in SDTM-ETL, everything is stored in an underlying define.xml, which also means that at the end of the process, we do already have a define.xml, which we only need to clean up a bit to come to a "submission-ready" define.xml.

Clicking "OK" then generates the mapping script completely automatically:

For those who understand "path-systems" a bit, in line 6, it selects the OIDs (identifiers) in the source data for "I_LB_RBC" "I_LB_WBC" in the subform with identifier "IG_LB_HEMATOLOGY" in the form "F_LAB" and that for all visits (StudyEvents).
It then executes the mapping from the original item identifier to SDTM-CT, stores it in a temporary value "NEWCODEDVALUE" and in the last line stores it in the variable LB.LBTESTCD.
Simple, isn't it? And all that without writing a single line of code.

If then the button "Test - Run in R" is clicked, the script is extended and slightly reorganized to be able to run "standalone" in R-Studio:

and when run, leads to the following DataFrame in R-studio:

 

It doesn't show any other information yet, like the visit number, the date/time of collection, etc.., as it is only "local" testing. It is only when the R-scripts for each of the variables are combined, the full picture will appear.

For LBORRES, we can do exactly the same drag-and-drop, and as the system remembers the previous settings, and the sytem knows it now need the value, not the item identifier, this just leads to a one-liner in the mappingscript editor. When then clicking "Test - Run in R, this leads to the following in R-studio:

Good observers have already noticed that the upper part is always (almost) identical, setting up things, and loading a set of functions. In the lower part, the mapping script from the "Mapping editor" is just embedded into an iteration over all subjects. The way we do this may however change in future.

For LBTEST (test name), we could follow the same procedure as for LBTESTCD, as the system already knows what the codelist for LBTEST is. However, in "regular" SDTM-ETL, there is a smarter way: it has a function "decode", which takes the value of a --TESTCD variable, and then "decodes" it to the value of LBTEST.
Although not everything is ready yet, we already made a function for that in R: 

It just accepts the location of the underlying define.xml, the OID of the codelist (which is in the underlying define.xml anyway) and the value of --TESTCD.

In the mapping for LBDTC (date/time of collection), there is a peculiarity that we need to concatenate the value of the visit date and the visit time:

 

So we need two drag-and-drops.
The peculiarity here is that at the second drop, the path needs to be relative to the first drop. Also this is something that we solved, leading to the "test" R-script in R-Studio:

 
Resulting in the DataFrame:

 


Some of you may already ask themselve: "What is the relationship with Oak", the CDISC COSA (CDISC Open Source Alliance) project?". 

Oak is a great library of R-functions for use in developing mappings to SDTM. Oak doesn't have any build-in knowledge of CDISC standards though. As far as I could find out, it doesn't have functions for ODM import (or selection of items from ODM for mapping), it is still up to the user to import data in any form (usually "flat" format) and make it suitable for mapping to SDTM. Neither does it have an export function to SAS-XPT (that can however be realized using the xportr package), nor for modern CDISC Dataset-JSON (which can however be done using the "datasetjson" package). It has also no functions for e.g. automated codelist-codelist mapping based on word similarity, synonyms or company-specific libraries. It also does not use or produce define.xml. In Oak, CDISC codelists are to be read in as CSV files. Oak also doesn't seem to use RESTful Web Services (RWS), doesn't know what the CDISC Library is or does, nor has it an automated way of generating LBTESTCD, LBTEST, LBMETHOD, etc. starting from the provided LOINC code using the famous "LOINC-CDISC Mapping" services. This is one of the functionalities that is most appreciated by our users in our "regular" SDTM-ETL. Such a function can however also easily be implemented in R starting from the following code:

and as an example result for LOINC code 34183-2 (Total Beta globulin/Protein in 24 hour Urine):

 

So one cannot see Oak as a "complete solution". It is a set of useful functions for R, i.e. provides some easy-to-use "building blocks". This of course also provides the flexibility I talked about before.
Some of these Oak functions can however also be used in SDTM-ETL-R, as Oak has a good amount of functions for "post-processing", such as for assigning --SEQ values, derive the --LOBXFL (Last Observation before first Exposure Flag), study day derivation (--DY values), a function to move "non-standard variables" (NSVs) from the original domain to a SUPPxx dataset (unfortunately the information about the NSV needs to be delivered as … CSV …), derive DM.RFXSTDTC (Date/Time of First Study Treatment) and DM.RFXENDTC (Date/Time of Last Study Treatment) from information in the EX dataset. Such "post-processing" steps ones are done by XSLT or streaming parsing (SAX) in "regular" SDTM-ETL. What I did miss in Oak however, and which is present in "regular" SDTM-ETL is a function for automated assignment of VISITNUM for unscheduled visits. Also there do not seem to be any functions for generating RELREC or SV (Subject Visits) in Oak. These are easily generated in "regular" SDTM-ETL though.

So yes, we will very probably use Oak in SDTM-ETL-R, especially for some of the "post-processing" steps such as --SEQ assignment, --DY generation, --LOBXFL assignment, and maybe generation of SUPPxx. For the latter however, we do not want to use CSV files, as in SDTM-ETL, the information of what an NSV is (and its properties) is already in the underlying define.xml. This makes more sense.

Even when we use these Oak functions, we do not need to change anything in the Graphical User Interface:

 

What we will need to do however is to replace the existing "post-processing" modules by (calls to) Oak R-functions.

What are the next steps?

First we need to further adapt our existing mapping functions (we have about 40 of them) to R. This will also enable us to extend our example R-mappings as a "test bed" for how to proceed.
We will also start working on assembling the individual mappings for the variables for a single domain into a single R-script. This can be a script that calls (or imports) other R-scripts, or assembles everything into a single script. We still need to find out what is best. Of course, this "domain" script (or scripts) will call other functions, our own ones as well as Oak ones.

We will regularly report about our progress.

Personally, I am still convinced that, due to its user-friendliness, build-in SDTM knowledge, and many functionalities for which there is no equivalent in R nor Oak, "regular" SDTM-ETL can be the first choice for 80% of the studies, if not for 90% of the studies. The use of R however has the advantage of more flexibility that can be of advantage for the remaining 10-20%, for example for very complicated studies, or for studies where data quality is low and need a lot of "rearrangement" of the data. The use of R allows to e.g. change the order in which scripts are executed, or insert "special" scripts for exceptional cases. This is also possible with the XSLT generated by SDTM-ETL (and which most of the users never see), but there are not so many people who have such a good knowledge of XSLT that they can "adapt" this XSLT. For R however, there is a much large "specialist" base, especially as R is used a lot in academia, and millions of people are educated in it (even my daughter who studied "international development studies" had to learn it).

So, a software tool with the user-friendliness and build-in SDTM knowledge such as SDTM-ETL, but that uses R and Oak for the scripts may have a place in the market for SDTM (but also SEND) mapping tools. As such, it may combine "best of both worlds".

Do you think we should further develop SDTM-ETL-R? Please submit a comment here, or send us an e-mail.

 

 

 


 

 


 

 

 

 

 

 

 

 


 

 

 

Saturday, June 7, 2025

Adding Custom Rules to CDISC CORE - first experiences

 

Today, I started gaining experience with the CDISC CORE feature that one can add its own "custom" core rules to the CORE engine. Essentially, this is a great feature, not offered at all by e.g. Pinnacle21.

How to add custom rules to the CORE engine is explained on the CORE GitHub website, but I still had to experiment a bit to get it all right. That is also the reason for this blog, so that the reader has a clear path and explanation how this can be accomplished.

As I did already have some experience with the development of CDISC CORE rules as a volunteer (I was made responsible for the SEND rules, and some of the FDA Business Rules", I could make a jumpstart.

So, I first made a custom rule that all subjects in my studies may not be older than 65 years:

simply using NotePad++.

Important here (as I found out mostly by trial and error) are:
- You can and should define your own "Organization"
- Normally, you will define custom rules as part of your own (set of) standard(s), in this case, "XML4Pharma_Standard". Originally, I had "CDISC" here, which caused my rule not to run. In this screenshot, that is already corrected. CROs may have several such sets of standards, e.g. for each sponsor customer.
- Assign an own ID for the rule, which is different from any of the CDISC CORE IDs

I also added a second one (with ID XML4P2) stating that AEDECOD must always be populated.
Whereas the first rule is more meant for specific studies, the second is more a "Quality Assurance" rule for submissions to regulatory authorities.

I then added the YAML files (one per custom rule) to a directory where we will manage all your custom rules. In our case case, we used the directory "D:\CDISC_Standards\CDISC_CORE_extensions":


 

Then, using Windows Powershell and the CORE Windows distribution, I used the command:

.\core.exe update-cache --custom_rules_directory D:\CDISC_Standards\CDISC_CORE_extensions

which failed, as the engine wants to have the "api key", message = 

Error: Missing option '--apikey'.

So I added it:

.\core.exe update-cache --custom_rules_directory D:\CDISC_Standards\CDISC_CORE_extensions --apikey {apikey}

where {apikey} is replaced by my real API key.

This lead to the message:

Added 2 rules: XML4P2, XML4P1
Cache updated successfully.

One also needs to set up a file that links the custom rules with a "custom standard", which is done in a JSON file.
In our case, I defined a "custom standard" named "xml4pharma_standard", with the contents:


  "xml4pharma_standard/1-0": [ 
    "XML4P1", "XML4P2" ]
}

Remark that I used lowercase for "xml4pharma_standard": the value will later be used case-sensitive when starting a validation. It took me some time realizing that …
The value after the slash is the version of our custom standard. This will enable versioning of rules and standards.

This JSON file with links between the custom standard and custom rules is then submitted to the engine using:

.\core.exe update-cache --custom_standard D:\CDISC_Standards\CDISC_CORE_extensions\XML4Pharma_Standard.json --apikey {apikey}

Leading to the message:

Added 1 new standards
Cache updated successfully

One can then check whether these are found, using:

 .\core.exe list-rules --custom_rules

which shows both the custom rules I defined in JSON format (although I defined them in YAML):

or, leading to the same result, as we only defined one "standard":

.\core.exe list-rules --custom_rules -s xml4pharma_standard -v 1-0

One can however imagine that a CRO may want to define several "custom" standards, one for each sponsor customer.

However, the command:

.\core.exe list-rule-sets

does not list the newly defined standard - it only lists the rules defined by CDISC.

For running against the custom rules, I then used:

.\core.exe validate -cs -s xml4pharma_standard -v 1-0 -d D:\MetaDataSubmissionGuide_2_0_JSON_1-1_Files_testing\SDTM_Dataset-JSON

I found that two things are very important to get this to work:

* one really needs to add the "-cs" keyword
* for the standard name, provided immediately after "-s", this is case sensitive.
When one e.g. uses "-s XML4Pharma_Standard" validation executes, but in the result Excel report, the tab "Rules_Report" will not list any results. So the name of the standard in the ".\core.exe validate -s" must exactly match with what is in the file XML4Pharma_Standard.json:


  "xml4pharma_standard/1-0": ["XML4P1", "XML4P2" ]
}

I then noticed that, although my study has participants over 65 years old (DM.AGE, DM.AGEU=YEARS), no issues were reported for DM. Reason was that originally, my rule originally stated that this is an SDTMIG rule, instead of that it is a rule of my custom standard. So I corrected the YAML, and then used the "update-cache" "--update_custom_rule" command

.\core.exe update-cache --update_custom_rule D:\CDISC_Standards\CDISC_CORE_extensions\AGE_65.yaml --apikey {apikey}

leading to the message:

Updated rule: XML4P1
Cache updated successfully

I then ran:  

.\core.exe validate -cs -s xml4pharma_standard -v 1-0 …

again, and indeed, the engine reported issues for all the subjects in DM that are over 65 years old:


Other important remarks:

* I used the CORE Windows distribution 0.10.0 for adding these custom rules. This will of course also work in the same way when using the native Python distribution: one just needs to replace ".\core.exe" by "python core.py".
For Linux (and probably for Mac), one should just use "./core" instead of ".\core.exe".
* I haven't found out how one can run the CDISC/FDA rules together with the custom rules in just one run. When I do so (if it is possible) I will update the blog.

Future work:

We recently developed software that uses the CDISC Library and its API to query the CDISC "Dataset specializations".
Essentially, these describe combinations of variables and their properties that belong together for individual cases. For example, in SC (Subject Characteristics) , when SCTESTCD=MARISTAT (Marital Status), it is expected that the value of SCSTRESC (Standardized result) is one of:
[ANNULLED, DIVORCED, DOMESTIC PARTNER, INTERLOCUTORY, LEGALLY SEPARATED, MARRIED, NEVER MARRIED, POLYGAMOUS, SEPARATED, 'WIDOWED], 
which can be translated into a CORE check:



In the define.xml, this will usually be defined by a "ValueList" (yes, we also have already automated the generation of such valuelists using the CDISC Library API).

We are currently setting up a library (as a separate "standard") that we will add to the CORE engine as a set of custom rules. This will be explained in a separate blog.

Another interesting thought is about the use of RESTful web services (RWS) to have "on-the-fly" rules. For example, to test whether the values of LBTESTCD, LBTEST, LBMETHOD, LBRESTYP, LBRESSCL, …, are in accordance with the value in LBLOINC.
We do already have the mappings between LOINC and SDTM for about 10,000 LOINC codes and an API for it, so taken LBTESTCD, LBTEST, LBMETHOD, LBRESTYP, LBRESSCL, ... into account, this would lead to >40,000 rules.
Is that what we want? Probably not …
We will probably want have only 5-6 rules that take the value of LBLOINC, then uses the RWS, and check whether the value in LBTESTCD matches. The same for LBTEST, LBMETHOD, etc.

One could also think about having something similar for checking e.g. whether the value in AEDECOD (which usually takes the "preferred term" in MedDRA) and the value in AELLT (MedDRA "Lower Level Term") match. However, in such a case, we cannot use a public RESTful Web Services due to license restrictions, but we can think of companies having their own (licensed) implementation of MedDRA as a central database, and having CORE rules that query this database, e.g. using an RWS.

Essentially, we are only at the beginning of exploring the opportunities that CORE, as Open Source software brings us. We can even start thinking about the use of "Dataset Specializations" and "CORE on the fly" during the development of the mappings. Image for example someone coding that the standardized result (--STRESC) for LBTESTCD=WEIGHT, is "cm", then the system would immediately protest and stating that only "kg" and "LB" are allowed. 

The only real limitation is (as so often): lack of imagination ...




 


 


 

 




Sunday, March 16, 2025

New PMDA rules for CDISC standards published March 2025

PMDA (the Japanese regulatory authorities) have recently published new rules for use of CDISC standards.

The Excel file with the rules can be downloaded from: https://www.pmda.go.jp/files/000274354.zip

Unfortunately, the rules are only published in the form of an Excel file, so not in a "vendor-neutral" format, and also barely or not usable for execution in software, also as it is "text only", so not containing any "machine-readable" instructions, even not in "meta-language". For example, I am missing "precondition" statements (except for "domain"), clearly stating (best in machine-readable form) when a rule is applicable or not. Preconditions act like filters, and are key in the description of rules to be applied in software.

With exception of additional rules, e.g. for SDTMIG-3.4 and some clarifications, there is not much new under the (Japanese) sun, even the format of the Excel worksheets hasn't changed since 2019. One may see this as a good sign of consistency, but I do have another opinion.

Even more problematic is that many of the rules keep being open for interpretation, meaning that different implementers (such as software vendors) may have a different implementations, leading to different results. That is of course unacceptable. CDISC CORE, as a "reference implementation", does a much better job here.

Furthermore, many of the rules have a too vague description (as was already the case in the past), and do not contain the necessary information so that any software company can implement them. So I ask myself whether these rules have really been developed by the PMDA, or by an external party that has interests of keeping the rule descriptions vague.

Lets have a look at a few examples:

What does this rule mean? It does not even mention the define.xml which is the place where value-level conditions are defined, and where one can find the associated codelists for value-level variables. And what is meant with "extensible codelist"? Is the CDISC codelist meant? Or the "extended" (not "extensible) CDISC-codelist in the define.xml?
So, enormously open for different interpretations …

Another one that must requires more and better explanation is rule SD0003:

 

It says "must conform the ISO 8601 international standard". It does not say which part of that standard … If one takes the rule literally, it allows e.g. P25D (25 days) as a value for e.g. LBDTC. This is probably not what is meant, probably what is meant is that it must comply with the ISO 8601 notation for dates, datetimes, partial-dates and so one, so e.g. "2025-03-16T09:22:33". But the rule doesn't say that …

Another questionable rule is SD0052:


Does this also apply to "unscheduled" visits? When I have two visits which both have VISIT = "Unscheduled Visit", must they have the same value for VISITNUM? I doubt so, as for unscheduled visits, they must have values like "3.1", "3.2", "4.1". Or should one take care that in such a case the value for "VISIT" is e.g. "Unscheduled Visit 3.1"? The rule does not say anything about this …
The SDTMIG-3.4 in section 4.4.5 states:

If one follows this, both the cases "left null" and "generic value" would violate PMDA-SD0052, at least when taking the rule literally.

Another one, but which is related to the use of SAS Transport 5 ("XPT" format) is rule SD1212:


This is problematic as the (still mandated) SAS-XPT format stores numbers in the "IBM mainframe notation" which is not compatible with modern computers (that use IEEE), and --STRESN is a number and --STRESC is character. So, what is the meaning of "equal"? Is it a violation when e.g. LBSTRESC is ".04" and the visual presentation (e.g. using the SAS Universal Viewer) of it is "0.04". I have seen lots of what in my opinion are "false positives" in the implementation of one vendor.
Time also PMDA moves to the modern CDISC Dataset-JSON format for submissions.

What I also found interesting is that in the Excel, "Message" comes before "Description", as if the rule is the rule is already pre-attributed to be implemented by a single vendor. It is also the question whether a vendor-neutral organization like the PMDA should impose on the vendors what the message is in a software implementation. If "Message" would be replaced by "Example message" and come after the "Rule description" in the Excel I would already feel better.

Let's now have a look at the Define-XML rules, an area where there have been a lot of inaccuracies in the rule in the past.

Just take the first ones from the Excel: 

 

 

The first rule DD0001 already astonishes me. The rule description "There is an XML schema validation issue …" is surely not a rule, it is an observation. The rule should sound something like: "Any define.xml file must validate without errors against the corresponding XML-schema". Also the sentence "it is most likely an issue with element declarations" does not belong in a rule description.

Rule DD0008 is essentially not necessary as when the element is not in the correct position, the define.xml will not validate without errors against the corresponding XML-Schema. Again, this is an observation, not a rule.
Also rule DD0002 is not necessary, as when one of the namespaces is missing, the define.xml will not validate against the schema. What is in the "description" is essentially not more than a recommendation, or "requirements lookup".

And what to think about the wording in rule DD0010 "… should generally not …". What does "generally" mean here? Does it mean that there are exceptions? If so, what are they?

What surprised me (well, not really) is the use of the wording "should". I found it 26 times in the Excel file. Essentially, the word "should" should never appear in a rule. "Should" is an expectation, and expectations do not belong in rule formulations.

I always compare it to saying to your teenage daughter or son: "you should be back home by midnight". Do you think he or she will then really be back at home at 12 pm on a Saturday night? If you say. "you must be back home by midnight", that already sounds more strict isn't it?

I did not check every rule in the Excel in detail. That would be too frustrating …

Unfortunately, such rule definitions open the door for different interpretations, leading to different results when using different implementations, of which some will surely be "false positives", or conflict each other. This is something we observed in the past with the SDTM rules of one vendor: When you did something "A", you got an error or warning "X", and when you did it "B", the warning/error X disappeared, but you got an error/warning Y. So. whatever you did, you always got an error or warning. Pretty frustrating isn't it?

Does this publication show improvement when compared with the older versions? I would say "very little".
The way the rules are published (as non-machine-executable Excel) and the way they have been written (some seem to be written as copy-pasted from leaflets in the "complaint box" at the entrance of the PMDA restaurent.

As CDISC-CORE will soon also start working on the PMDA rules, I presume CDISC will be in contact with PMDA to discuss every one rule, check it for necessity, reformulate it to be precise in cooperation with PMDA coworkers, and then publish them again, together with the open source implementation (which currently is in as well human-readable as machine-readable YAML/JSON code). The CDISC open implementation can then serve as the official "reference implementation".



 

 


 






 

 

Saturday, November 25, 2023

The Need for Speed: Why CDISC Dataset-JSON is so important.

The CDISC community has been suffering for 20 years or more by the obligation of FDA (and other regulatory authorities following FDA) to submit datasets using the SAS Transport 5 (XPT) format.
The disadvantages and limitations of XPT are well known: limitations to 8, 40 and 200 characters, only US-ASCII encoding only, etc.. But there is much more. Essentially, the use of XPT has essentially been a road-blocker for innovation at the regulatory authorities all these years.
Therefore, the CDISC Data Exchange Standards Team has developed a modern exchange format, Dataset-JSON, which, as the name states, is based on JSON, the currently worldwide must used exchange format anyway, especially for the use with APIs (Application Program Interfaces) and RESTful Web Services.
The new exchange format is currently being piloted by the FDA, in cooperation with PHUSE and CDISC.

Unlike XPT, Dataset-JSON is really vendor-neutral and much , much easier to implement in software than XPT. This has also resulted in a large number of applications being developed and showcased during the COSA Dataset-JSON Hackathon. There are however many opportunities created by the new format, which are however not well recognized by the regulatory authorities.
XPT is limited to the storage of "tables" in "files", i.e. two-dimensional. JSON however allows to represent data (and metadata) in many more dimension and deepness. This means that, even when Dataset-JSON will at first still be used to exchange "tables", these can be enhanced and extended to also carry audit trails (much wanted by the FDA), source data (e.g. from EHRs, lab transfers) and any type of additional information, as well on the level of the dataset, the record, as the individual data point.
Furthermore, Dataset-JSON will allow to embed images (e.g. X-Rays, EMRs) and digital data like ECGs into the submission data.

The major advantage of using this modern format is however on another level.

Traditionally, submissions to regulatory authorities are only done after database closure, mapping the data to SDTM, SEND and/or ADaM, etc.. This essentially means a period of often several months are the clinical study has been finalized, and years after the clinical study has been started. In the mean, many patients may have died or seriously harmed, as the treatment they need, is not available yet. This is what we call "the need for speed".

Dataset-JSON can be game changer here.

Essentially, partial submission datasets can be generated as soon as the first clinical data are received from the sites. The regulatory authorities are however not used to start reviewing as soon as the first clinical data is available, among others, due to their technical infrastructure.
JSON is especially used worldwide for use with APIs and RESTful web services, meaning that even submission data can be exchanged real time, once they are created. Although JSON can of course be used with and for "files", the real strength is in its uses for "services". All other industries have moved from files to SOA, "ServiceOriented Architecture".

What does this mean for regulatory submissions? 

Imagine a "regulatory neutral zone" (one can discuss what "neutral" means) between sponsor and regulatory agency, where the sponsor can submit submission records (not necessarily as "files") as soon as they are created, using an API and e.g. using RESTful Web Services. Using the same API, records can also be updated (or deleted) when necessary, using audit trails. On the other side, reviewers can query the study information from the repository, using the API, not necessarily by downloading "files" (although that remains possible), but by getting answers on questions or requests like "give me all subjects and records of subjects with a systolic blood pressure of more than 130 mmHg that have a BMI higher than 28".
This "regulatory neutral zone" is surely different from the current "Electronic Submissions Gateway" (which is completely file based), but more related to API-governed repositories used in many other (also regulated) industries such as aviation, financial, etc..

Essentially, when all this in place, regulatory submission could be started as soon as the first data points become available, and finalized much sooner (even months or years sooner) as is currently the case. This can then save the life of thousands of patients.


 

Monday, January 9, 2023

CDISC SDTM codetables, Define-XML ValueLists and Biomedical Concepts

Yesterday, I started an attempt to implement the "CDISC CodeTables" in software to allow even more automation when doing SDTM mapping using our well-known SDTM-ETL software.
As the name says it, CDISC has published these as tables, and so far only as Excel worksheets. Unfortunately, this information is not in the CDISC-Library yet, otherwise it would only have costed me a relative simple script to access the CDISC-Library API and a few hours to get all the information implemented as Define-XML "ValueLists".

Essentially, I do not really understand (others will probably say "he does not want to understand") why these codetables were not published as Define-XML ValueLists right from the start. Was it that the authors have limited or no Define-XML knowledge (there are CDISC trainings for that ...) or is it still the thinking that Define-XML is something that one produces after the SDTM datasets have been produced (often using some "black box" software of a specific vendor), rather than using Define-XML upfront (pre-SDTM-generation) as a "specification" for the SDTM datasets to be produced (the better practice). Or is it just still the attitude of using Excel for everything ...: "if all you have is Excel, everything is a table".
Now, I do not have anything against tables. I have been teaching relational databases at the university for many years, and these are indeed based on ... tables. The difference however is that in a relational database, the relations are explicit (using foreign keys), where in all the CDISC tables (including for SDTM, SEND and ADaM), the relations are mostly implicit, described in some PDF files.

When I start looking into the Excel files, I immediately had to say "OMG" ...

Each of the Excel files seems to have a somewhat different format, some with and other without empty columns, and completely different headers. So even when I wrote software to read out the content, I would still need to adapt the code (or use parameters) for each input file to have at least some chance of success. Although far from ideal, I then wrote such a little program, and could at least produce some raw XML CDISC CodeLists, although the results still require a lot of afterwork.

So I started with the DS (Disposition) codetable, which went pretty smooth.

Then I decided to tackle a more complicated one, the codetable for EG (ECG - Electrocardiogram).
I knew this would be a non-trivial one, as the EG domain itself is pretty weird. In contrast to normal CDISC practice, EGTESTCD and EGTEST have 2 codelists as can be seen in the CDISC-Library Browser

i.e. one for classic ECGs and one for Holter Monitoring tests.

Personally, I consider this very bad practice. The normal (good) practice is to have a single codelist, and then use Define-XML ValueLists with "subset" codelists for different use cases. This is a practice also followed by CDISC for other domains, e.g. by publishing a subset-codelist for units specifically for Vital Signs tests.

Also, when creating SDTM datasets, we define subset codelists all the time in our define.xml, e.g. based on the category (--CAT variable), but we also generate a subset codelist with only the tests that appear in our CRFs or were defined in the protocol. For example for LB (Laboratory) we will not submit all 2500+ terms for LBTESTCD and LBTEST, but only the ones we used or planned to use.

But maybe the authors of this part of the standard were unaware of define.xml, subset codelists, and especially Define-XML "ValueLists" and the nice possibility to work with "WhereClauses".

So, the codetable for EG, in Excel format, comes with two tabs: "EG_Codetable_Mapping" and "HE_Codetable_Mapping":

 

That the latter is for the "Holter Monitoring Case" is not immediately obvious: there is even no "README" tab explaining the use cases.

As usual (and unfortunately), there are different sets of columns for the different variables the subsets of codes apply to:


This makes it hard to automate anything to use it in software: either one needs to revamp the columns, or do a huge amount of copy-and-paste (as before the CDISC-Library days).

When comparing the contents of the tabs, things even get more complicated.
Some subset codelists appear in both tabs, others such as the ones for units (for EGSTRESU, depending on the value of EGTESTCD) only in the first. Does this means the units subsets are not applicable to the Holter Monitoring use case?

When then comparing the subsets for the value of EGSTRESC (depending on EGTESTCD) in both tabs, some are equal (e.g. for the case of EGTESTCD=AVCOND), others are different, with a range of only 1 term different, to a larger set of terms being different.

I tried to resolve all this by adapting my software - it didn't work well. So I started doing ... copy and paste ...

This results in subset codelists like:


with some codelists coming in two flavors, one for the normal case and one for the Holter Monitoring case - of course I gave these different OIDs.

For the units, the organization in the worksheet is pretty unfortunate, so e.g. leading to:


stating that for each of EGTESTCD being JTAG, JTCBAG, JTCBSB and JTCFAG the only allowed unit is "msec" (milliseconds) for EGSTRESU.
This is valid for use in Define-XML "ValueLists". The "WhereClause" would then e.g. say:
"Use codelist CL.117762.JTAG.UNIT" for EGSTRESU when EGTESTCD=JTAG".

The better way however is to define one codelist, e.g. "ECG_Interval" and define a WhereClause stating when it should be used for EGSTRESU. This leads to e.g. for the Define-XML ValueList and WhereClause:


with the subset item and codelist defined as:

 

and the ValueList of course assigned to EGSTRESU:

 

Essentially, this is all very related to Biomedical Concepts!
For example the concept "JTAG" (with name "JT Interval, Aggregate" ) would then have the property that it is an ECG test (and thus related to EGTESTCD/EGTEST in SDTM) with the property that the unit for it can only be "msec", at least when using "CDISC notation" for the unit. The better would however be to use the UCUMnotation, which is "ms" and which is used everywhere in health care except for at CDISC ..., and which has the advantage of allowing automated unit conversion, which is not possible with CDISC units.

CDISC has now published its first Biomedical Concepts in the CDISC-Library which can be queried using the Library RESTful API:


For example, for the BC "Aspartate Aminotransferase Measurement", the API response (in JSON) is:

 

As I understand it, CDISC is also working on generating BCs starting from codetables, especially for the oncology domains and codelists, where we have similar dependencies between standardized values (--STRESC) and possible units (--STRESU).

It would then be great if we can see all the by CDISC published codetables published as BCs, and made available by the CDISC-Library through the API. With the SDTM information than added, these then correspond to the ValueLists in the define.xml of our SDTM submission.

But I will start with converting these awful Excel codetables to Define-XML CodeLists and ValueLists (with the corresponding WhereClauses of course) first.

Essentially, it should be forbidden that CDISC publishes standards (and even drafts of them) as Excel files, but it should only be allowed that a real and standardized machine-readable form, such as based on XML or JSON, is used. This would finally allow much better QC for the draft standards (instead of visual inspection!) and make the standards immediately usable in systems and software.

I presume many of you will disagree, so your comments are always welcome!