Saturday, July 31, 2021

The CDISC CORE project and open, machine-executable rules

 This Spring, CDISC announced a new initiative, to develop, in collaboration with Microsoft, open source software and machine-executable conformance rules for validation of implementations of a number of its foundational standards: the CORE project.
CORE stands for "CDISC Open Rules Engine". A very good webinar was given last week, explaining the principles and phases of the project. You can find the recording here.

In order to explain the "why" of the project, we need a bit of history ...

Until about 5 years ago, CDISC did not publish conformance rules for its most used standards: SDTM, SEND, ADaM, ODM and Define-XML.As far as I could find out, the first CDISC conformance rules were published in 2017 for SDTMIG-3.2. The probable reason for this is that there was a "leave it to the vendors to implement" policy at CDISC before that time.

Several private initiatives and companies filled the gap, including our own company: already 15 years ago, we developed software to validate ODM files against the standard, the ODM Checker. The software has regularly been updated and is still available at no cost to CDISC members. It has also been extended for Define-XML. These were also integrated into commercial products such as our popular SDTM-ETL mapping software and Define-XML Designer. We did however never publish the rules as "open source", as we did not want to pretend that our interpretation of the ODM/Define-XML standard is the ultimate correct one.

For SDTM and later also for SEND and ADaM, other companies developed software for conformance checking, some first as open source, later more and more as closed source. The conformance rules, with these companies own interpretations of the Implementation Guides, were published as Excel files at the best, and only contained a (sometimes vague) description of each of the rules, or even only the error message that is generated. Some of such rules are even completely wrong, such as "Original Units (--ORRESU) should not be NULL, when Result or Finding in Original Units (--ORRES) is provided". In no case at all, the rules were published as "machine-executable".
One company even managed to sell its solution to the FDA and later also to the PMDA and NMPA, essentially developing the rules on behalf of these regulatory authorities. Although that software is known for its many false positives, it is used by almost every pharma company. Another vendor however managed to find its place in the market by specializing on SEND, later extending to SDTM.

About 5 years ago, CDISC started publishing its own conformance rules, first for the SDTMIG, and later also for the SENDIG and for ADaM. Also here, the conformance rules were published in the form of Excel files, with the rules themselves coming as a text description, i.e. without machine-executable code. Recently, also conformance rules for Define-XML v.2.1 were published by CDISC. Also here, no machine-executable code was published. 

For both Define-XML 2.0 as well as 2.1 such machine-executable rules however exist in the form of a Schematron, a technology used by several vendors (including ourselves), but these machine-executable rules are propriety, and are used in different free or commercial software products.

This leads to a discussion about technology for conformance rules (if you don't like technology discussions, you can skip this part). CDISC ODM and Define-XML are based on XML, a worldwide open standard from the World Wide Web consortium W3C, who also, together with ISO, took care of developing technologies and standards for checking the conformance of XML files against a given standard: XML-Schema Schematron and XQuery. The ODM and Define-XML teams have always published their ODM and Define-XML specifications together with an XML-Schema, meaning that 60-80% of the rules are already exactly described in a language that is both human-readable (well, for IT people at least), as well as machine-executable. The remaining 20-40% then can then be implemented using Schematron and/or XQuery, the latter especially when information is in separate files. Also this has been done by us and by other vendors.
The problem with this approach however is that these technologies are limited to XML. Although there was a believe in the past that data exchange would become an "XML-only world", this has not come true. At one side, the format for submissions to regulatory authorities is still SAS Transport 5, a completely outdated "punch card" format from the IBM mainframe time, and mandated by FDA, PMDA and NMPA, this although CDISC has a much better format (which provides a perfect match with Define-XML), known as Dataset-XML. This modern format even allows to develop "smart review tools", such as the open source "Smart Submission Dataset Viewer". On the other hand, other formats become very popular, especially JSON and, for linked data using RDF, the "Turtle" format and JSON-LD format. One could think that the latter two would not be suitable for CDISC's submission standards (SDTM, SEND, ADaM), as these represent tabular data. One should however not forget that also data in these "tables" are essentially linked data, but that the relations are implicit, only described in Implementation Guides. And as these relations are implicit, in order to check these relations, one needs ... conformance rules. Regarding JSON, it's usage has overtaken XML for use with APIs and in RESTful web services.
The still mandated use of SAS Transport 5, a format even discouraged by the US Library of the Congress adds another one to the list of formats used in clinical research.
Given this variety of exchange formats, one can think about whether it is possible to have a single expression language for developing conformance rules that is both human-readable (i.e. really "open") and machine-executable, and that can be used with all these modern as well as the outdated SAS Transport 5 format.
I have been looking for such as expression language for many years, but failed miserably to find one. This will surely be one of the main challenges of the CORE project.

The idea of publishing completely open conformance rules for the CDISC submission standards that are as well human-readable as well as machine-executable is not new - it has even been realized by the "OpenRules for CDISC Standards" initiative.


The rules have been developed using the XQuery language, a W3C standard, and some of them even use the CDISC Library API. So, essentially, they could serve as a candidate for a reference implementation if it were not that XQuery only works for XML (and thus only for CDISC's Dataset-XML), but is useless for use with SAS Transport 5, as well as for modern JSON, which is the first choice for use with APIs.
"Open Rules for CDISC Standards" was developed many years ago with the expectation that FDA would soon replace SAS Transport 5 by CDISC's own Dataset-XML, but this has not happened. With JSON strongly coming up, it has become clear that the sole use of XQuery for describing human-readable rules that also are machine-executable, is not an option anymore.
One of the great starting principles of the "Open Rules for CDISC Standards" was that the rules themselves (in XQuery) are completely separated from any execution engine: implementers can choose between many different computer languages (Java, C#, Python, ...) that read the rules and then execute them. Separation between rules and execution engine will very probably also be one of the major design principles of the CDISC CORE project. 

For the development of machine-executable conformance rules in CORE, CDISC will start with SDTMIG 3.4 and SDTM 2.0 in the first phase, which will be an "MVP" phase. No, MVP doesn't mean "most valuable player" here, it means "Minimum Viable Product". In later phases, machine-executable conformance rules for the other standards, on the long term maybe even including Therapeutic Area User Guides (TAUGs).

The execution engine will be developed by the CORE team which is a collaboration between Microsoft, CDISC, and industry, and run in the cloud using Azure. CDISC members will be able to obtain an evaluation Azure account, which is then essentially acting as a private cloud for a CORE implementation. Implementers, such as pharma companies, CROs, service providers, can later choose to use the open source code and spin up instances in other cloud environments. They will also be able to add own (e.g. company-specific) rules to their execution engine (whether a Microsoft or other one) and/or to develop their own implementation, either open source or closed source. So, one can indeed think as CORE providing a "reference implementation" that anyone can use, extend, or just use as "the reference", i.e. the outcomes of any conformance checking must be identical to that of the reference implementation, even when completely different technology is used. The CDISC Library will be the single source of truth for both the CDISC standards and the CDISC rules since the rules will be made available in the Library. The CORE Engine will retrieve the rules and standards from the Library using an API.

This means that in the MVP phase, involvement of vendors will be relatively low, but as of phase 1, it is expected that there will be a lot of involvement from vendors, either directly working together with CDISC and Microsoft (as we envisage to do), or just go their own way, using the open source. As everything will be open source, vendors can also choose between offering products that use a cloud execution engine, or create a solution that uses local production environments (e.g. desktop applications). 

This is a big project. There will be quite a lot of CDISC teams involved, for example a QA team, a number of conformance rules development teams (Conformance rules for SDTMIG 3.4 / SDTM 2.0 are expected to be published in Autumn 2021), the CDISC Library team, and a software engineering team, which I presume will consist of a mix of CDISC and Microsoft people. And of course, the overall architecture must be developed and project management must be taken care of by a CORE Leadership team. More details can be found on the slides of the webinar recording.

Executable rules will be metadata driven (of course!), but it has not been decided yet what programming language will be used to make them machine-executable. Personally, I consider this as a critical part of the project, as the people developing the rules (standards specialists, e.g. SDTMIG specialists, mostly volunteers) usually are not programmers (no, I do not expect that the rules will be developed in SAS, this would be a major design error, as that would not be vendor-neutral), and the programmers (with good knowledge of Java, C#, Python ...) usually do not have a good knowledge about the SDTM standard and their Implementation Guides. So a lot of communication (and documentation of that communication) will be necessary between these two groups: the last we want to have is that CORE (based) software produces false positive warnings and errors ...

Another critical part will surely be QA and testing: what rule is applicable when, and are all the possible scenarios covered? Testing will require a huge amount of test data, covering every possible use of the standard.

All this work cannot be done by CDISC and Microsoft people alone. Therefore, CDISC is seeking for volunteers out of the CDISC community for the different teams involved in the project, in the MVP phase probably mostly SDTM specialists for development of very precise rules: only when a rule is very precisely defined, it can be made machine-executable (see the FDA rule on -ORRESU ...). So the webinar also contained a "call for participation". CDISC would like to have volunteers from the community that can at least spend 20% of their time in the next 9 months, and ideally, beyond that period. A kick-off meeting is already planned for September 9th, so there is not much time to lose.
The call for participation can be found on the CORE website under "participate" and contains a list of what teams and roles there are. This will be your starting point if you want to participate.

A very interesting part of every CDISC webinar is the Q&A. In this case, not every question that came in could be answered (but some bundling took place). CDISC however promised that all question will be answered and will be posted on the CDISC website. A few highlights:
- the reporting format will be anything the user wants. This is possible as there will be a rich set of API methods available. An Excel report interface will surely be provided.
- it is not decided yet what programming language will be used. Will it be a 3th or 4th generation language (Java, C#, Python, ...) or a meta-language that can be interpreted and translated to source code in one of these languages? As my own attempts failed to find such a language that is independent of the data format, I am of course very curious ...
Important however is that the rules implementation will be metadata driven, and that it needs to be relatively easy for members of the CDISC community (sponsors, CROs, ...) to develop their own specific rules as machine-executable rules and implement them in exactly the same way as the CDISC rules themselves.
- it is intended to also include rules developed by regulatory authorities (FDA, PMDA, NMPA, ...), so the scope is not limited to CDISC rules alone.
- as the CORE software will be enriched by a large number of API methods, it will be easy to integrate CORE into third-party applications. CDISC will carefully listen to the vendor community to find out which API methods are necessary.
- although there of course a number of deliverables and time lines, the conformance rules will never be complete as long as new standards are developed. When new standards and versions are published, it is envisaged that they immediately come with their own set of machine-executable CORE rules, which can then be implemented immediately.
- upon the question "will the regulatory agencies adopt the CORE rules?", the answer essentially was "we don't know, but we strongly hope so". One should not forget that in some cases, these agencies rules deviate considerably from CDISC rules, and that currently, some of these agency rules are just wrong (e.g. the FDA rule "Original Units (--ORRESU) should not be NULL, when Result or Finding in Original Units (--ORRES) is provided"), ambiguous, or even incorrectly implemented in software. However, rules from the regulatory agencies are surely within the scope of the project.
- another question was about how these conformance rules differ from already existing applications and vendors". Peter van Reusel answered it very exactly: "not much". The difference is however that for the first time, the implementation will not only be fully transparent, but also independent of the execution engine. Also, the rules themselves will be maintained by the CDISC community, and that with each new version of a standard, the corresponding conformance rules, in an open, machine-readable form, will be immediately available. Venkata Maguluri (Pfizer) of the webinar's panel also added that these rules will not allow any "wiggle room" anymore in their interpretation. Personally, I consider this "wiggle room" (sometimes even "cleft room") as one of the major problems of the way the rules are currently described and published (as "text" in Excel format),especially with the regulatory authorities published rules having severe quality problems.
- the period between publication of a new standard version and adoption by regulatory authorities will be used to ensure that the conformance rules also work for the agencies. Also, CDISC will provide the agencies with technical support regarding the implementation of the execution engines. 

Last but not least, what does this all mean for our "Open Rules for CDISC Standards" project?
We are very enthusiastic about the CORE project at CDISC. Although much larger in scope and volume, the basic principles have a lot of overlap with our own "Open Rules for CDISC Standards": in both cases, the rules are fully transparent, human- as well as machine-executable, and separate from the execution engine. Both use or will use the CDISC Library as "the single source of truth". The major thinking error we made when starting with "Open Rules for CDISC Standards" many years ago, was that we expected that it would soon be an "all-XML world" for data exchange, and especially that FDA (and then followed by PMDA/NMPA) would soon move away from SAS Transport 5 and start requesting modern (CDISC's own) Dataset-XML format for electronic submissions. None of these have come true: FDA still requires outdated SAS Transport 5, and instead of becoming an "all-XML world", JSON has become a very important player, especially in combination with APIs and RESTful web services (which I consider as the future, also for submissions). Also RDF has become important as a methodology that makes implicit SDTM/SEND/ADaM relations explicit instead of completely hidden in PDF or HTML files.
However, I hope "Open Rules for CDISC Standards" becomes a source of inspiration, and maybe even a good number of rules implemented as XQuery can directly be translated into machine-executable CORE rules.

Not everything is clear yet, e.g. the expression (machine) language for defining the rules, and that works for any modern data format as well as outdated SAS Transport 5, has not been decided yet. This is where I failed miserably myself. Of course, Microsoft has much more resources and a lot of brilliant people to find out what that expression language should be.

Also, I do have some serious concerns that in the MVP phase, for SDTM, one will limit to an implementation that only works for SAS Transport 5, and that implementation for other formats will only be done at a later stage. That would be a wrong signal to the FDA not to look for any other alternative, modern format, and provide them with the excuse that they cannot move to a modern format, as CORE only supports SAS Transport 5, although that is essentially not true. In my opinion,  interfaces and APIs can take care that a large range of import formats can be supported, and that should already go into the requirements, even in the MVP phase. Maybe I can provide some input or technical support for that part of the project: we should not forget that developing APIs and implementing them as RESTful web services is relatively easy for JSON and XML, but that no RESTful web service has ever been developed yet that supports SAS Transport 5. I expect that to be more difficult than for JSON and XML.

Another concern regarding possible (Microsoft) vendor lock-in has already been taken away during the web conference: it will be possible to use any other cloud provider or to develop local (e.g. desktop) applications.

CORE is a very ambitious project. It is even considerably larger than the CDISC Library project. The huge success of the latter however makes me confident that also CORE will be a huge success.

And please, do not forget to watch the recording of the webinar if you did not attend the webinar already: you can find it here.












Monday, March 1, 2021

LOINC-SDTM mapping for Drug and Toxicology Lab Test

This week I started working on a mapping between LOINC codes for Drug and Toxicology lab tests (LOINC class "DRUG/TOX") and the CDISC SDTM LB domain and controlled terminology (CT) for it.
This work is not only important for sponsors and CROs who obtain lab results accompanied by the LOINC code (which should be the routine nowadays), and need to generate SDTM datasets, but also for being able to use "Real World Data" (RWD) data e.g. from Electronic Health Records (EHRs). It is also of utmost important for being able to (semi-)automatically generate CDISC Biomedical Concepts (BCs) from LOINC panel codes (groups of LOINC codes for tests that logically belong together), a topic on which I will speak (and perform a demo) at the European CDISC Interchange 2021 in April .

The task is however, at first look, enormous: this class contains 8314 LOINC codes (LOINC v.2.69) with 2605 distinct values for the analyte (LOINC "Component").The published CDISC-LB mapping only contains mappings for 852 DRUG/TOX LOINC codes, so, there are still 1800 "to go". Some of the work can however be automated, but it still remains a lot of work...

I first retrieved all the DRUG/TOX LOINC codes with its attributes from my local install of the LOINC database, and generated 2 worksheets (yes, I sometimes do use Excel), one with all the codes that have more than one target CDISC specimen type (LBSPEC), like for LOINC System= "Ser/Plas" ("Serum or Plasma"), as these require more than 1 mapping row in the final database. E.g. for "Ser/Plas", this will lead to 3 rows, one with LBSPEC="SERUM" (NCI code C13325), one with LBSPEC="PLASMA" (NCI code C13356) and one with LBSPEC="SERUM OR PLASMA" (NCI code C105706). The second worksheet then contains all the DRUG/TOX LOINC codes where a 1:1 mapping between the LOINC "System" and LBSPEC is expected.

Some of the work can be automated. For most of the LOINC "System" values, a mapping to LBSPEC already exists and can easily be reused. Some additional work may have to be done for the mapping between the LOINC "Method" and LBMETHOD. Also attention has to paid to fasting statuses and "challenges" and "post-dose" entries (if any). But most of the manual work is on mapping the analyte (LOINC "Component") to LBTESTCD/LBTEST, as this is essentially the meaning of the LBTESTCD/LBTEST pair: it represents the analyte, i.e. the compound that is measured.
What is represented by --TESTCD/--TEST pairs in SDTM differs between domains. For example, in Vital Signs (VS), VSTESTCD/VSTEST represents the property that is measured (e.g. a blood pressure). The property that is measured is not directly represented by a variable in LB. For example, if a concentration is measured, this can in LB only be seen from the actual values and units. In LOINC however, the "Property" is an essential part of the concept (one of the 5/6 "dimensions" of LOINC). In the by CDISC published LOINC-LB mapping this has been solved by adding some "Non-Standard Variables" (NSVs) which then go into the SUPPLB dataset.

Then I started the huge work ...

For generating the mapping between the LOINC "Component" (i.e. the analyte) and LBTESTCD and LBTEST, I used the CDISC Library Browser which was of great help because it also displays "similar" ways of writing a term as well as synonyms. It also allows me to immediately add the CDISC-NCI code of LBTESTCD/LBTEST to the mapping, which is of utmost importance for connecting to other coding systems used in healthcare (like SNOMED-CT), e.g. using the Unified Medical Language System UMLS and its API and RESTful web services.

Here is a picture of a few rows of the mapping:

 As I found out soon, the coverage of test codes for drug and toxicology lab testing in the CDISC-CT for LBTESTCD/LBTEST is very poor. After one day of mapping work, I estimates the coverage to be between 5 and 10%. This also means that for 100 drug/toxicology lab tests, we would need to to 90-95 "new term requests" to CDISC for a LBTESTCD/LBTEST. Considering the 1800 codes not covered yet by the original LOINC-LB mapping, this would mean something like 1600 to 1700 "new term requests". I guess the CDISC-CT team will "not be amused" ...

This urged me to rethink the problem.

Mapping is "bad" - personally I think it should be the last resort if nothing else works. 1:1 mapping can still be acceptable (but requires a large amount of work), but we are in deep trouble when such a 1:1 mapping is not possible.

Each unique LOINC "component" (i.e. the analyte) has a code itself: the "LOINC Part Code" (LP-codes). For example, the LP code for "Albumin" is LP6118-6. The LP code for Glucose is LP14635-4. The LP code for Doxycycline (one of the many not covered by CDISC-CT) is LP14992-9. This brought me to the idea "Why not use the 'LOINC Part Code' for LBTESTCD?".

Similarly, one could then use the "LOINC Part Name" for LBTEST. 

There are a few major objections against this, some of them having to do with the by the FDA mandated use of outdated SAS Transport 5 format for submissions.
The first is that LBTESTCD may not be longer than 8 characters. "LP14992-9" has 9. Also the "LOINC Part Name" sometimes has more than 40 characters. Even if we drop the "LP" from the code, we still have a problem. For example for "LP14992-9" this would reduce the code to "14992-9" but the SDTM rules (for sake of SAS Transport 5) state that "Values of --TESTCD must be limited to eight characters and cannot start with a number, nor may they contain characters other than letters, numbers, or underscores". So even the dash "-" is not allowed ... Dropping the dash and the check digit is in my opinion not a good idea, as it is an important measure against typing errors. Remark that the rules for -TESTCD/-TEST are based on making "transposal" possible in XPT datasets.

So, what we see once again, is that the SAS Transport 5 format is a "show stopper" for any "out of the box thinking".

The second thing I found out is that, with extremely few exceptions", every of the LOINC "Component" values, i.e. the analyte has a SNOMED-CT code. For example, the SNOMED-CT code for Doxycyclineis 372478003.

So, why not use the SNOMED-CT code for the analyte LBTESTCD with the SNOMED-CT name for LBTEST?

OK. Same problem: SNOMED codes are often longer than 8 characters, and do start with a number, so they cannot be used for LBTESTCD due to this (stupid?) SDTM rule that is only there to satisfy the outdated SAS Transport 5 format. Using "LOINC Parts" and "SNOMED-CT" for test codes would also have the advantage that it provides links to other codes and terms. After all, both are "hierarchical" and "network" coding systems. CDISC-CT just is consisting of ... lists.
For example, medicinal products containing Doxycycline are characterized by the SNOMED-CT code 10504007. And a "parent" code of it is "Substance with antimalarial mechanism of action" with SNOMED-CT code 373287002.

Here is a nice diagram taken from the "SNOMED-CT browser":

Can one do something similar with CDISC-CT? No way ...

So, why isn't CDISC using SNOMED-CT at all (except in the SDTM Trial Summary (TS) domain)?

An explanation is found on the CDISC website in the "knowledge base":

The first argument (SNOMED license) is not entirely correct. It should say "most governments". Even in Europe, where we are far behind the US in using SNOMED-CT, there is almost no country anymore that does not have a country-license. Even then, the "Knowledge base" applies double standards: MedDRA is not free at all for anyone, one needs to have a (rather expensive) license. So arguing that some (a minority) would have to pay to use SNOMED-CT and at the same time mentioning that MedDRA is mandated by regulatory agencies, for which one always has to pay, is in my opinion not correct at least.

Also the second argument, that SNOMED-CT does not have "definitions" is entirely incorrect. Every SNOMED-CT term does have a definition.
Furthermore, the "network" properties of SNOMED-CT are not mentioned at all. They should.

Please do remark that I do not plead for replacing all CDISC-CT by SNOMED-CT. There are many cases where this doesn't make sense. What we should however do is start discussing the use of LOINC codes, LOINC parts for tests and possibly for post-coordination of test parts (where also SNOMED-CT does a better job), LOINC answers for standardized results and start discussing the better use of SNOMED-CT within CDISC and especially within submission standards, and stop trying to keep LOINC and SNOMED-CT "out of the door". It is also in the advantage of pharma sponsors to use these terminologies, and I strongly think that especially sponsors who want to start using "real world data" should push CDISC harder to embrace LOINC and SNOMED-CT, providing webinars, trainings, implementation guides, etc..

CDISC is a founding member of the "Joint Initiative Council for Global Health Informatics Standardization" (JIC), together with LOINC and SNOMED, but this seems to be reflected in our work only marginally. That is really a pity.

And, we should not forget, clinical research is only less than 5% of healthcare, and that other 95% is using SNOMED-CT and LOINC all the way ...

Reactions are as always very welcome!
And if you also feel that CDISC should take LOINC, UCUM, and SNOMED-CT more seriously, don't tell me, tell CDISC (e.g. the CSO).

Sunday, December 13, 2020

Modernizing the CDISC SDTMIG: making the IG more "transport format neutral"

 A few weeks ago, I had a long discussion, first per webconference, with follow-up per E-Mail, with the CDISC Standards direction, concerning the use of LOINC , SNOMED-CT, and UCUM, especially their absence in Therapeutic Area User Guides (TAUGs). We also had a long discussion about why the outdated SAS Transport 5 format is still used (and required by regulatory authorities) and why CDISC could not convince FDA, PMDA and NMPA to move to a modern format.

One of the statements that came in per E-mail and that struck me is the following (I cite):

"In the meantime the SAS v5 limitations are being used against CDISC as antiquated and something from the past".

I answered that I very well understand that this happens.
I however need to explain why I believe this is so, and especially why I believe (personal opinion) that this is justified.

If we take a look at the SDTM Implementation Guides (further abbreviated as SDTMIG, last version: 3.3, November 2018), then we see that it has so many statements and rules that are only there because of (the limitations of) SAS Transport 5. It looks as it never came up to the authors that other formats could be possible. For example, users of the SDTMIG that do not submit to the FDA, PMDA or NMPA, do NOT use SAS Transport 5 (SAS v5). CDISC promotes SDTM to be used in academic studies (with a reasonable amount of success), but academics really do not use SAS Transport 5. Furthermore, there is a good number of mapping tools on the market that do not use SAS Transport 5 for SDTM generation, but only "export" to SAS Transport 5 in the very last step. Behind the curtain, they use either XML, JSON, or "modern SAS".
I have seen quite a number of such studies (both academic and non-submission) using either CSV (comma-separated-values) or XML for storing and exchanging SDTM datasets and studies. So, SAS Transport 5 (also named "XPT") should only be one of the use cases, but the SDTMIG is written as if it were the only use case). 

Essentially, and ideally, "semantic" standards like the SDTMIG should be independent from the transport format used.
HL7-FHIR nicely demonstrates this: The FHIR specification is completely neutral towards any transport format. Examples are provided for 3 (modern) formats: JSON, XML, and RDF. People could however use FHIR with any other transport format (even CSV).

The SDTMIG specification is however written in such a way as that only SAS Transport 5 would be the only possible transport format, which is simply not true.

I do very well understand that submission to FDA and other regulatory authorities (who still require SAS Transport 5) is a major use case of SDTM, but it is not the only one.
As I want to make a positive contribution, I will make a few proposals here how the SDTMIG could be more "Transport format neutral", without loosing the use case of XPT-submissions. This could then counteract the statement "the SAS v5 limitations are being used against CDISC as antiquated and something from the past".

Let us start with section 4.2.1: Variable-naming conventions. The text is:
"Values of --TESTCD must be limited to eight characters and cannot start with a number, nor can they contain characters other than letters, numbers, or underscores. This is to avoid possible incompatibility with SAS v5Transport files. This limitation will be in effect until the use of other formats (such as Dataset-XML) becomes acceptable to regulatory authorities".

I propose to change this into something like:
"In the case of the use of SAS v5 transport files, values of --TESTCD must be limited to eight characters, nor may they contain other than letters, numbers, or underscores. In the case of the use of other formats such as CSV, XML or JSON, this limitation does not apply".

The next one (in the same section) is:

"Variable descriptive names (labels), up to 40 characters, should be provided as data variable labels for all variables, including Supplemental Qualifier variables".

My proposal is to update this into something like:

"Variable descriptive names (labels), not using more than 40 bytes when using SAS v5 transport files, must be provided as data variable labels for all variables, including Supplemental Qualifier variables".

Two major remarks here: first, the use of the wording "must" instead of "should", as the latter represents an expectation in non-US English, and secondly, stating "40 bytes" instead of "40 characters". Reason is that PMDA and NMPA have started requiring labels in Japanese / Chinese for certain datasets, which require up to 3 bytes per character, meaning that for SAS Transport 5, labels cannot be longer than 13 Japanese / Chinese characters.

I hope to be allowed to explain this further during a presentation at the next CDISC Japanese Interchange (I have submitted an abstract). I wrote already something down about these issues here and here. 

Another example where the SDTMIG implicitly assumes XPT, in Section 4.5.3:

"Sponsors may have test descriptions (--TEST) longer than 40 characters in their operational database. Since the --TEST variable is meant to serve as a label for a --TESTCD when a Findings dataset is transposed to a more horizontal format, the length of --TEST is limited to 40 characters (except as noted below) to conform to the limitations of the SAS v5 Transport format currently used for submission datasets. Therefore, sponsors have the choice to either insert the first 40 characters or a text string abbreviated to 40 characters in --TEST. Sponsors should include the full description for these variables in the study metadata in one of two ways: ..."

My proposal to make this more "transport format neutral":
"Sponsors may have test descriptions (--TEST) longer than 40 characters in their operational database. Since the --TEST variable is meant to serve as a label for a --TESTCD when a Findings dataset is transposed to a more horizontal format, the value of --TEST may not exceed 40 bytes in the case the SAS v5 Transport is used. In case another format such as CSV, XML or JSON is used, this limitation does not apply.
Therefore, but only in the case the SAS v5 Transport is used, sponsors have the choice to either insert the characters for the first 40 bytes or a text string abbreviated not taking more than 40 bytes in --TEST. ..."

Also remark that in the define.xml (as it is XML), there is no limitation for the length (nor in bytes nor in number of characters) for the labels. HL7-FHIR has shown us that values can be thousand of characters, in any language...

In Section "Text Strings Greater than 200 Characters in Other Variables", the SDTMIG states:
"Some sponsors may collect data values longer than 200 characters for some variables. Because of the current requirement for the SAS v5 Transport file format, it is not possible to store the long text strings using only one variable. Therefore, the SDTMIG has defined conventions for storing long text string using multiple variables. For general-observation-class variables and supplemental qualifiers (i.e., non-standard variables), the conventions are as follows: ..."

I first propose to change the title of the section into "Use of SAS v5 Transport and text strings taking more than 200 bytes". The text can then be:
"Some sponsors may collect data values that take more than 200 bytes. In the case of the use of SAS v5 Transport, it is not possible to store the long text strings using only one variable. Therefore, the SDTMIG had defined conventions for storing long text strings using multiple variables when SAS v5 Transport is used. For general-observation-class variables and supplemental qualifiers (i.e., non-standard variables), the conventions are as follows: ..."

So, by changing the text slightly, it is both possible to accommodate for the use of non-ASCII characters (taking up to 3 or 4 bytes per character), as well for other formats such as CSV, XML, JSON.
Also remark that the following text snippets like "The first 200 characters of text should..." must then be changed into something like "The first 200 bytes of characters of text must ...". The reason is that the SAS-XPT limitation is not 200 characters, it is 200 bytes. Only in the case of ASCII, 1 character can be stored in 1 byte.

I will not try to listen every (of the hundreds of cases) where XPT is implicitly assumed here, like in section 5.1 "Comments", such as (but not limited to):
"When the comment text is longer than 200 characters, the first 200 characters of the comment will be in COVAL, ..."

to be replaced by something like:
"In the case of the use of SAS v5 Transport, when the comment text requires more than 200 bytes, then the characters for the first 200 bytes of the comment will be in COVAL, ...".

In the tables for the domains, we can then replace each instance of "The value in ... cannot be longer than 8 characters" and "The value in ... cannot be longer than 40 characters" into:
"In the case of the use of SAS v5 Transport, the value in ... cannot be longer than requiring 8 bytes" and "In the case of the use of SAS v5 Transport, the value in ... cannot be longer than requiring 40 bytes".
If that is not clear enough, one could even add: "In the case of other transport formats, this requirement does not apply".

Remark that with such updates / replacements, we "get two for the price of one", taking into account the new requirements of PMDA and NMPA for the use of "Asian" characters in some datasets, and  broadening the scope of SDTM, making it also more popular in the academic world as for non-FDA/PMDA/NMPA submissions. 

I hope these proposals can also lead to making other (modern) formats acceptable by regulatory authorities, even beyond FDA/PMDA/NMPA, as many are thinking that there is a 1:1 relationship between SDTMIG and XPT.

After having done so, nobody will be justified anymore to say that "CDISC is antiquated and something from the past" just because of the SAS v5 Transport format!

Reactions are of course always welcome!











Tuesday, October 27, 2020

SAS Transport 5, CDISC NMPA Submissions, and Chinese characters


A number of weeks ago, I was pointed to a publication of the NMPA, the Chinese regulatory authorities, about new guidelines for CDISC submissions. Although I am not mastering the Chinese language, I could find the following statement:


drawing my attention.

Essentially, it states: "It is recommended to use XPT version 5 (XPT V5 for short) or similar as the data submission format", followed by "The sponsor should explain the encoding used (such as utf-8, euc-cn, etc.) to avoid garbled codes in the submitted data set".

When I read this, I was pretty shocked. Reason is that SAS Transport 5 (SAS-XPT), a thirty year old format from the IBM mainframe time and that it only supports US-ASCII encoding

So I asked some colleagues whether they could provide me a translation of the full guidance, which I received, and which confirmed my first impression.

The text also states that all labels should be in the Chinese language, and that important information like the "adverse event term", or medication names should be in the Chinese language. 

So, what is problematic about this all?

This requires some explanation about encodings, i.e. the way characters are stored as bits and bytes. There are very many encodings, but the most used nowadays is "UTF-8", as it allows for "Unicode", i.e. covering all written languages in the world. Depending on the character to be stored, UTF-8 uses 1 to 4 bytes for a single character.
US-ASCII, usually simply designated as "ASCII", is a very old encoding, only supporting "English" characters. It uses 1 byte per character. Essentially, ASCII is a subset of UTF-8.

UTF-8 is a "variable-width encoding", meaning that either 1 byte is used, or several bytes are used, depending on the character. That makes it an extremely efficient encoding.

- 1 byte: ASCII characters
- 2 bytes: Other Latin alphabets: Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic
- 3 bytes: Chinese, Japanese, Korean (CJK)
- 4 bytes: less common CJK, historic scripts, mathematical symbols, emoji

So, for Chinese characters in UTF-8 (recommended by the NMPA), one needs 3 bytes for a single character.

The also mentioned EUC-CN encoding uses 2 bytes per character, but is not supported by many systems, and is limited in the number of chinese characters it supports ("Simplified Chinese").

What are the consequences of using Chinese characters in SAS-XPT for SDTM/SEND/ADaM?

The SDTM specification states that variable names and test codes may not be longer than 8 characters, and labels and test names not more than 40 characters. This limitation is due to the (by the FDA) mandated SAS-XPT format, and is not entirely correct. The real statement should be that variable names and test codes may not occupy more than 8 bytes. The statement "8 characters" is entirely based on the assumption that ASCII-encoding is used.

So, for Chinese characters encoded as UTF-8, this means that variable names may not longer than int(8/3) = 2 Chinese characters, and labels not more than int(40/3) = 13 Chinese characters.
For variable values, the limit becomes int(200/3) = 66 Chinese characters.
For variable names and test codes, there is no real problem, as NMPA probably accepts that the variable names and test codes are "English", i.e. use ASCII encoding.
A huge problem however occurs for variable labels.

For example, when I translate the label "Reported Name of Drug, Med, or Therapy", I get "报告药品报告药品名称,药物治" which is ... 14 Chinese characters. Or: "Dictionary-Derived Term for the Healthcare Encounter", translates as: "词典中针对医疗保健遇到的术, also being 14 characters, i.e. taking 42 bytes, which is beyond the 40 byte limit. For CDISC test names, I haven't tried yet, but I suppose that some of them will, when translated to Chinese, be longer than 13 Chinese characters, and thus take more than 40 bytes. The reason for the 40-character (when using English, and supposing ASCII encoding) is that a test code becomes a column label when transposing, and labels may not be ...

Also questionable is what to do when a variable value takes more than 66 Chinese characters. How would one then need to split? Anyhow, such limitations are not of these times anymore, they were acceptable in the time of the punch cards, but we are living in the 21st century now.

Another huge problem is that there are currently no viewers at all that support non-ASCII encoding in SAS-XPT files. The often used "SASViewer" (currently not available from SAS anymore) and the "SAS Universal Viewer" [] do not support any other encodings that ASCII for XPT files. This was confirmed to me by SAS Support. For example, when I load an XPT file with Chinese characters, I get:

Some stated that using SAS Transport Version 8 would take away all of the above objections and limitations. This is however not entirely true: also SAS Transport 8 assumes that all characters are encoded as ASCII, and does not support non-ASCII encodings like UTF-8 or EUC-CN. This also means that, just like for version 5, there are no viewers: when I would load a SAS Transport 8 file with Chinese characters into the SASViewer or SAS Universal Viewer, I would get the same result: the Viewer would not recognize the Chinese characters (as it assumes ASCII encoding), and the Chinese characters do not display correctly.

So, why did NMPA choose for SAS Transport 5?

Well, I asked them, but did not get an answer. So I asked the question to people that have good connections to the NMPA, like some members of the "China CDISC Coordination Committee" (C3C). From the discussions with these excellent colleagues, I got the strong impression that NMPA just wants to more or less copy the requirements of the FDA, except than for the use of the language. NMPA did not seem to have thought about (or not understood) what the consequences of using SAS-XPT are.

Are there better solutions?

Of course, there are - we are living in the 21st century! Already in 2014, CDISC published the "Dataset-XML" standard, which was exactly meant as a replacement for SAS-XPT. It is based on XML, a modern, worldwide really open standard (i.m.o. SAS-XPT is semi-propriety), completely vendor-neutral (SAS-XPT isn't), and used in every industry, so not only in healthcare or clinical research (clinical research is the only industry still using XPT). XML supports any encoding, with the default encoding being ... UTF-8. XML does not have any of the limitations of SAS-XPT. Furthermore, Dataset-XML can be written and read by any modern software, including by SAS and by R statistical packages. Also other CDISC standards such as Define-XML and ODM are using the XML format. It is even so that Dataset-XML is based on both ODM and Define-XML, making it an "end-to-end" solution. That is also why the combination of Define-XML with Dataset-XML is often called "a marriage blessed in heaven".

So I also asked to my Chinese colleagues why NMPA is not recommending CDISC Dataset-XML format. The question that came back was whether FDA and PMDA already accept Dataset-XML. When I then explained about the Dataset-XML FDA pilot, and that the introduction of Dataset-XML has been put on ice, I got the answer (I cite): "If FDA/PMDA adopt XPT only, it will be difficult for NMPA (who has just joined ICH) to be the first agency to adopt dataset-xml. We may have to wait and see what decision other agencies to make".
What this has to do with ICH, I do not understand, as ICH does not mandate SAS-XPT format. Even the other way around, ICH's eCTD (electronic Common Technical Document) is based on ... XML.

For me it is clear that NMPA believes that by adopting/mandating SAS-XPT, it avoids risks. However, just the opposite is true: in my opinion, the use of SAS-XPT with Chinese characters will lead to huge problems at both the NMPA and at sponsors.

Some people asked me about my ideas about alternatives like using UTF-8 encoded CSV (comma separated values). So, I tested this and even added it to the list of supported formats in our famous SDTM-ETL mapping software. Such a CSV file then looks like (visualized in NotePad++):

Even such an extremely simple format would be a considerably better choice than SAS-XPT. When asked for a ranking for "suitability for Chinese characters in SDTM", I made the following table:

Transport format

Suitability Score



UTF-8 encoded CSV


SAS Transport 8


SAS Transport 5



SAS Transport 5 (SAS-XPT) format is the worst possible choice as a transport format for CDISC submissions to with Chinese characters to NMPA. It was developed for IBM mainframes (IBM mainframes did not support Chinese characters), and was never meant for anything else than "English" characters and ASCII encoding. It is not suitable at all for UTF-8 encoding and also never developed for that use case.

Several customers have come to me with questions about the new guidance of the NMPA and how to deal with it. My advice to them has been to negotiate the submission of their data sets in CDISC Dataset-XML format. If that is refused, they should propose UTF-8 encoded CSV, as that does not have any of the XPT limitations, is a simple format, and is still well readable by software packages such as SAS and R.