Sunday, December 13, 2020

Modernizing the CDISC SDTMIG: making the IG more "transport format neutral"

 A few weeks ago, I had a long discussion, first per webconference, with follow-up per E-Mail, with the CDISC Standards direction, concerning the use of LOINC , SNOMED-CT, and UCUM, especially their absence in Therapeutic Area User Guides (TAUGs). We also had a long discussion about why the outdated SAS Transport 5 format is still used (and required by regulatory authorities) and why CDISC could not convince FDA, PMDA and NMPA to move to a modern format.

One of the statements that came in per E-mail and that struck me is the following (I cite):

"In the meantime the SAS v5 limitations are being used against CDISC as antiquated and something from the past".

I answered that I very well understand that this happens.
I however need to explain why I believe this is so, and especially why I believe (personal opinion) that this is justified.

If we take a look at the SDTM Implementation Guides (further abbreviated as SDTMIG, last version: 3.3, November 2018), then we see that it has so many statements and rules that are only there because of (the limitations of) SAS Transport 5. It looks as it never came up to the authors that other formats could be possible. For example, users of the SDTMIG that do not submit to the FDA, PMDA or NMPA, do NOT use SAS Transport 5 (SAS v5). CDISC promotes SDTM to be used in academic studies (with a reasonable amount of success), but academics really do not use SAS Transport 5. Furthermore, there is a good number of mapping tools on the market that do not use SAS Transport 5 for SDTM generation, but only "export" to SAS Transport 5 in the very last step. Behind the curtain, they use either XML, JSON, or "modern SAS".
I have seen quite a number of such studies (both academic and non-submission) using either CSV (comma-separated-values) or XML for storing and exchanging SDTM datasets and studies. So, SAS Transport 5 (also named "XPT") should only be one of the use cases, but the SDTMIG is written as if it were the only use case). 

Essentially, and ideally, "semantic" standards like the SDTMIG should be independent from the transport format used.
HL7-FHIR nicely demonstrates this: The FHIR specification is completely neutral towards any transport format. Examples are provided for 3 (modern) formats: JSON, XML, and RDF. People could however use FHIR with any other transport format (even CSV).

The SDTMIG specification is however written in such a way as that only SAS Transport 5 would be the only possible transport format, which is simply not true.

I do very well understand that submission to FDA and other regulatory authorities (who still require SAS Transport 5) is a major use case of SDTM, but it is not the only one.
As I want to make a positive contribution, I will make a few proposals here how the SDTMIG could be more "Transport format neutral", without loosing the use case of XPT-submissions. This could then counteract the statement "the SAS v5 limitations are being used against CDISC as antiquated and something from the past".

Let us start with section 4.2.1: Variable-naming conventions. The text is:
"Values of --TESTCD must be limited to eight characters and cannot start with a number, nor can they contain characters other than letters, numbers, or underscores. This is to avoid possible incompatibility with SAS v5Transport files. This limitation will be in effect until the use of other formats (such as Dataset-XML) becomes acceptable to regulatory authorities".

I propose to change this into something like:
"In the case of the use of SAS v5 transport files, values of --TESTCD must be limited to eight characters, nor may they contain other than letters, numbers, or underscores. In the case of the use of other formats such as CSV, XML or JSON, this limitation does not apply".

The next one (in the same section) is:

"Variable descriptive names (labels), up to 40 characters, should be provided as data variable labels for all variables, including Supplemental Qualifier variables".

My proposal is to update this into something like:

"Variable descriptive names (labels), not using more than 40 bytes when using SAS v5 transport files, must be provided as data variable labels for all variables, including Supplemental Qualifier variables".

Two major remarks here: first, the use of the wording "must" instead of "should", as the latter represents an expectation in non-US English, and secondly, stating "40 bytes" instead of "40 characters". Reason is that PMDA and NMPA have started requiring labels in Japanese / Chinese for certain datasets, which require up to 3 bytes per character, meaning that for SAS Transport 5, labels cannot be longer than 13 Japanese / Chinese characters.

I hope to be allowed to explain this further during a presentation at the next CDISC Japanese Interchange (I have submitted an abstract). I wrote already something down about these issues here and here. 

Another example where the SDTMIG implicitly assumes XPT, in Section 4.5.3:

"Sponsors may have test descriptions (--TEST) longer than 40 characters in their operational database. Since the --TEST variable is meant to serve as a label for a --TESTCD when a Findings dataset is transposed to a more horizontal format, the length of --TEST is limited to 40 characters (except as noted below) to conform to the limitations of the SAS v5 Transport format currently used for submission datasets. Therefore, sponsors have the choice to either insert the first 40 characters or a text string abbreviated to 40 characters in --TEST. Sponsors should include the full description for these variables in the study metadata in one of two ways: ..."

My proposal to make this more "transport format neutral":
"Sponsors may have test descriptions (--TEST) longer than 40 characters in their operational database. Since the --TEST variable is meant to serve as a label for a --TESTCD when a Findings dataset is transposed to a more horizontal format, the value of --TEST may not exceed 40 bytes in the case the SAS v5 Transport is used. In case another format such as CSV, XML or JSON is used, this limitation does not apply.
Therefore, but only in the case the SAS v5 Transport is used, sponsors have the choice to either insert the characters for the first 40 bytes or a text string abbreviated not taking more than 40 bytes in --TEST. ..."

Also remark that in the define.xml (as it is XML), there is no limitation for the length (nor in bytes nor in number of characters) for the labels. HL7-FHIR has shown us that values can be thousand of characters, in any language...

In Section 4.5.3.2 "Text Strings Greater than 200 Characters in Other Variables", the SDTMIG states:
"Some sponsors may collect data values longer than 200 characters for some variables. Because of the current requirement for the SAS v5 Transport file format, it is not possible to store the long text strings using only one variable. Therefore, the SDTMIG has defined conventions for storing long text string using multiple variables. For general-observation-class variables and supplemental qualifiers (i.e., non-standard variables), the conventions are as follows: ..."

I first propose to change the title of the section into "Use of SAS v5 Transport and text strings taking more than 200 bytes". The text can then be:
"Some sponsors may collect data values that take more than 200 bytes. In the case of the use of SAS v5 Transport, it is not possible to store the long text strings using only one variable. Therefore, the SDTMIG had defined conventions for storing long text strings using multiple variables when SAS v5 Transport is used. For general-observation-class variables and supplemental qualifiers (i.e., non-standard variables), the conventions are as follows: ..."

So, by changing the text slightly, it is both possible to accommodate for the use of non-ASCII characters (taking up to 3 or 4 bytes per character), as well for other formats such as CSV, XML, JSON.
Also remark that the following text snippets like "The first 200 characters of text should..." must then be changed into something like "The first 200 bytes of characters of text must ...". The reason is that the SAS-XPT limitation is not 200 characters, it is 200 bytes. Only in the case of ASCII, 1 character can be stored in 1 byte.

I will not try to listen every (of the hundreds of cases) where XPT is implicitly assumed here, like in section 5.1 "Comments", such as (but not limited to):
"When the comment text is longer than 200 characters, the first 200 characters of the comment will be in COVAL, ..."

to be replaced by something like:
"In the case of the use of SAS v5 Transport, when the comment text requires more than 200 bytes, then the characters for the first 200 bytes of the comment will be in COVAL, ...".

In the tables for the domains, we can then replace each instance of "The value in ... cannot be longer than 8 characters" and "The value in ... cannot be longer than 40 characters" into:
"In the case of the use of SAS v5 Transport, the value in ... cannot be longer than requiring 8 bytes" and "In the case of the use of SAS v5 Transport, the value in ... cannot be longer than requiring 40 bytes".
If that is not clear enough, one could even add: "In the case of other transport formats, this requirement does not apply".

Remark that with such updates / replacements, we "get two for the price of one", taking into account the new requirements of PMDA and NMPA for the use of "Asian" characters in some datasets, and  broadening the scope of SDTM, making it also more popular in the academic world as for non-FDA/PMDA/NMPA submissions. 

I hope these proposals can also lead to making other (modern) formats acceptable by regulatory authorities, even beyond FDA/PMDA/NMPA, as many are thinking that there is a 1:1 relationship between SDTMIG and XPT.

After having done so, nobody will be justified anymore to say that "CDISC is antiquated and something from the past" just because of the SAS v5 Transport format!

Reactions are of course always welcome!