Sunday, September 28, 2025

SDTM-ETL and R: Best of both Worlds?

Some time ago, there was an interesting discussion on LinkedIn (unfortunately I cannot find it anymore) about R versus SAS for generating SDTM datasets. My response was that I considered both inefficient and expensive (in training cost and/or license costs) as there are now software tools on the market that have a graphical user interface (GUI), use "drag-and-drop", combined with wizards, to automate the mappings and generate the SDTM dataset. Examples of such software are our own SDTM-ETL, but also Ryze by Formedix (now owned by Certara) and XClinical's Marvin-Tabulator. This triggered a discussion about advantages and disadvantages of such "specialized for SDTM" software versus general purpose software such as R. One of the arguments by a contributor (not me) was that R is more flexible, whereas SDTM-specialized software such as SDTM-ETL, due to the user-friendliness, will probably be considerable more efficient for 80% of the studies. For very complicated studies however, and especially for studies with low-quality source data, R would then be the better choice due to the flexibility. Another contributor then asked whether there is "SDTM-specialized" software, that has a user-friendly GUI, uses "drag-and-drop", has build-in SDTM knowledge, but uses R as the scripting language.

As far as I know and could find out, there is no such software available.

SDTM-ETL uses "EasyXSL" as the scripting language (80-90% of the scripts are generated automatically), which is an easy-to-learn language that is on top of XSLT, the very powerful XML transformation language. This means that upon execution of the mappings, the EasyXSL scripts are combined (per domain or dataset definition) and are translated into XSLT, which is then executed on the source data, with generation of the SDTM datasets in either SAS-XPT, modern CDISC Dataset-JSON-1.1 format, or CSV.

So the idea was born: is it possible to have a version of SDTM-ETL, with all its advantages, which uses R as the scripting language instead of EasyXSL? This would mean that the by SDTM-ETL generated R-scripts can be fed into an R engine, and then be executed by that R engine. Essentially it would mean that it would be possible to combine the user-friendliness (and build-in SDTM knowledge) of SDTM-ETL with the flexibility and extensibility of R.
This could then be a "best of both worlds".

So we decided to start developing a prototype (just as a "proof of concept") of such a software. For the moment, we call it "SDTM-ETL-R".

What is important for us is that all existing "drag-and-drop" features and "SDTM knowledge" and  "wizards" remain intact, and that only difference is that the final generated script is now in R instead of in EasyXSL. For the moment, the execution is then done in R-Studio.

We started with DM as the first domain, but this was pretty easy, as DM only has 1 record per subject. For example, the mapping for RACE, automatically generated by SDTM-ETL, is then found to be:


 

and when tested on a set of clinical data (in CDISC ODM format), as exported from an EDC system, using the button "Test - Run in R":

and providing a file with clinical data by a "file chooser" (not shown here), the script is extended and reorganized in order to run for each subject in the source, and then opens in R-Studio:

 

When then run, the "DataFrame" is generated and displayed in R-Studio:

 

As this was "too easy", we started working on LB (Laboratory), which is always more difficult. 
The output from the EDC system (in ODM) is organized as follows:

  

Do not mind the colors too much for the moment - it just means that the items have annotations attached, already stating where they are supposed to come in SDTM. 
We always first start with the "topic" variable, which is LBTESTCD in this case. What we need to accomplish is to obtain one record per test (described by LBTESTCD/LBTEST) per visit per subject. As one can easily deduce, there are two tests here: "Erythrocytes" and "Leukocytes" (this is very simple example study for demo purposes only).
The target SDTM variable is shown in the table on the right side of the screen:

So we drag either "Erythrocytes" or "Leukocytes" from the left to the cell "LB.LBTESTCD". When we drag-and-drop "Erythrocytes", a dialog appears:

It automatically suggests that we take the identifier (Item-OID) from the source, not the value.
But we also want to have an entry in the output for Leukocytes, not only for Erythrocytes, and we want to have that for each visit.
So we select the checkbox "Generalize for all StudyEvents", and also check the checkbox "Generalize for all items", followed by clicking "Only for":

 

A list is then displayed from which we need to choose - we select "Erythrocytes" and "Leukocytes":

 

We do not want to use any of the other items for now, we will need them later to e.g. populate LBSTAT, LNORNRLO, LBORNRHI, LBNAM, etc..

Clicking OK then leads to the next "wizard":

 

The system loads the SDTM Controlled Terminology (CT) for LBTESTCD from the underlying define.xml, and presents all the values in the comboboxes:

 

Again we are lazy and allow the system to find out what the mappings need to be based on word similarity, if necessary additionally supported by CDISC's "synonym list", or our company synonym list. The latter can e.g. be useful when the lab test names are obtained in another language than English. So, we click "Attempt 1:1 mappings". The system, in all its wisdom, then proposes:

 

which looks fine …

At this stage, we can also generate the subset codelist for LBTESTCD and have it stored in the underlying define.xml.
Reason is that in SDTM-ETL, everything is stored in an underlying define.xml, which also means that at the end of the process, we do already have a define.xml, which we only need to clean up a bit to come to a "submission-ready" define.xml.

Clicking "OK" then generates the mapping script completely automatically:

For those who understand "path-systems" a bit, in line 6, it selects the OIDs (identifiers) in the source data for "I_LB_RBC" "I_LB_WBC" in the subform with identifier "IG_LB_HEMATOLOGY" in the form "F_LAB" and that for all visits (StudyEvents).
It then executes the mapping from the original item identifier to SDTM-CT, stores it in a temporary value "NEWCODEDVALUE" and in the last line stores it in the variable LB.LBTESTCD.
Simple, isn't it? And all that without writing a single line of code.

If then the button "Test - Run in R" is clicked, the script is extended and slightly reorganized to be able to run "standalone" in R-Studio:

and when run, leads to the following DataFrame in R-studio:

 

It doesn't show any other information yet, like the visit number, the date/time of collection, etc.., as it is only "local" testing. It is only when the R-scripts for each of the variables are combined, the full picture will appear.

For LBORRES, we can do exactly the same drag-and-drop, and as the system remembers the previous settings, and the sytem knows it now need the value, not the item identifier, this just leads to a one-liner in the mappingscript editor. When then clicking "Test - Run in R, this leads to the following in R-studio:

Good observers have already noticed that the upper part is always (almost) identical, setting up things, and loading a set of functions. In the lower part, the mapping script from the "Mapping editor" is just embedded into an iteration over all subjects. The way we do this may however change in future.

For LBTEST (test name), we could follow the same procedure as for LBTESTCD, as the system already knows what the codelist for LBTEST is. However, in "regular" SDTM-ETL, there is a smarter way: it has a function "decode", which takes the value of a --TESTCD variable, and then "decodes" it to the value of LBTEST.
Although not everything is ready yet, we already made a function for that in R: 

It just accepts the location of the underlying define.xml, the OID of the codelist (which is in the underlying define.xml anyway) and the value of --TESTCD.

In the mapping for LBDTC (date/time of collection), there is a peculiarity that we need to concatenate the value of the visit date and the visit time:

 

So we need two drag-and-drops.
The peculiarity here is that at the second drop, the path needs to be relative to the first drop. Also this is something that we solved, leading to the "test" R-script in R-Studio:

 
Resulting in the DataFrame:

 


Some of you may already ask themselve: "What is the relationship with Oak", the CDISC COSA (CDISC Open Source Alliance) project?". 

Oak is a great library of R-functions for use in developing mappings to SDTM. Oak doesn't have any build-in knowledge of CDISC standards though. As far as I could find out, it doesn't have functions for ODM import (or selection of items from ODM for mapping), it is still up to the user to import data in any form (usually "flat" format) and make it suitable for mapping to SDTM. Neither does it have an export function to SAS-XPT (that can however be realized using the xportr package), nor for modern CDISC Dataset-JSON (which can however be done using the "datasetjson" package). It has also no functions for e.g. automated codelist-codelist mapping based on word similarity, synonyms or company-specific libraries. It also does not use or produce define.xml. In Oak, CDISC codelists are to be read in as CSV files. Oak also doesn't seem to use RESTful Web Services (RWS), doesn't know what the CDISC Library is or does, nor has it an automated way of generating LBTESTCD, LBTEST, LBMETHOD, etc. starting from the provided LOINC code using the famous "LOINC-CDISC Mapping" services. This is one of the functionalities that is most appreciated by our users in our "regular" SDTM-ETL. Such a function can however also easily be implemented in R starting from the following code:

and as an example result for LOINC code 34183-2 (Total Beta globulin/Protein in 24 hour Urine):

 

So one cannot see Oak as a "complete solution". It is a set of useful functions for R, i.e. provides some easy-to-use "building blocks". This of course also provides the flexibility I talked about before.
Some of these Oak functions can however also be used in SDTM-ETL-R, as Oak has a good amount of functions for "post-processing", such as for assigning --SEQ values, derive the --LOBXFL (Last Observation before first Exposure Flag), study day derivation (--DY values), a function to move "non-standard variables" (NSVs) from the original domain to a SUPPxx dataset (unfortunately the information about the NSV needs to be delivered as … CSV …), derive DM.RFXSTDTC (Date/Time of First Study Treatment) and DM.RFXENDTC (Date/Time of Last Study Treatment) from information in the EX dataset. Such "post-processing" steps ones are done by XSLT or streaming parsing (SAX) in "regular" SDTM-ETL. What I did miss in Oak however, and which is present in "regular" SDTM-ETL is a function for automated assignment of VISITNUM for unscheduled visits. Also there do not seem to be any functions for generating RELREC or SV (Subject Visits) in Oak. These are easily generated in "regular" SDTM-ETL though.

So yes, we will very probably use Oak in SDTM-ETL-R, especially for some of the "post-processing" steps such as --SEQ assignment, --DY generation, --LOBXFL assignment, and maybe generation of SUPPxx. For the latter however, we do not want to use CSV files, as in SDTM-ETL, the information of what an NSV is (and its properties) is already in the underlying define.xml. This makes more sense.

Even when we use these Oak functions, we do not need to change anything in the Graphical User Interface:

 

What we will need to do however is to replace the existing "post-processing" modules by (calls to) Oak R-functions.

What are the next steps?

First we need to further adapt our existing mapping functions (we have about 40 of them) to R. This will also enable us to extend our example R-mappings as a "test bed" for how to proceed.
We will also start working on assembling the individual mappings for the variables for a single domain into a single R-script. This can be a script that calls (or imports) other R-scripts, or assembles everything into a single script. We still need to find out what is best. Of course, this "domain" script (or scripts) will call other functions, our own ones as well as Oak ones.

We will regularly report about our progress.

Personally, I am still convinced that, due to its user-friendliness, build-in SDTM knowledge, and many functionalities for which there is no equivalent in R nor Oak, "regular" SDTM-ETL can be the first choice for 80% of the studies, if not for 90% of the studies. The use of R however has the advantage of more flexibility that can be of advantage for the remaining 10-20%, for example for very complicated studies, or for studies where data quality is low and need a lot of "rearrangement" of the data. The use of R allows to e.g. change the order in which scripts are executed, or insert "special" scripts for exceptional cases. This is also possible with the XSLT generated by SDTM-ETL (and which most of the users never see), but there are not so many people who have such a good knowledge of XSLT that they can "adapt" this XSLT. For R however, there is a much large "specialist" base, especially as R is used a lot in academia, and millions of people are educated in it (even my daughter who studied "international development studies" had to learn it).

So, a software tool with the user-friendliness and build-in SDTM knowledge such as SDTM-ETL, but that uses R and Oak for the scripts may have a place in the market for SDTM (but also SEND) mapping tools. As such, it may combine "best of both worlds".

Do you think we should further develop SDTM-ETL-R? Please submit a comment here, or send us an e-mail.