Tuesday, October 27, 2020

SAS Transport 5, CDISC NMPA Submissions, and Chinese characters

 

A number of weeks ago, I was pointed to a publication of the NMPA, the Chinese regulatory authorities, about new guidelines for CDISC submissions. Although I am not mastering the Chinese language, I could find the following statement:

 

drawing my attention.

Essentially, it states: "It is recommended to use XPT version 5 (XPT V5 for short) or similar as the data submission format", followed by "The sponsor should explain the encoding used (such as utf-8, euc-cn, etc.) to avoid garbled codes in the submitted data set".

When I read this, I was pretty shocked. Reason is that SAS Transport 5 (SAS-XPT), a thirty year old format from the IBM mainframe time and that it only supports US-ASCII encoding

So I asked some colleagues whether they could provide me a translation of the full guidance, which I received, and which confirmed my first impression.

The text also states that all labels should be in the Chinese language, and that important information like the "adverse event term", or medication names should be in the Chinese language. 

So, what is problematic about this all?

This requires some explanation about encodings, i.e. the way characters are stored as bits and bytes. There are very many encodings, but the most used nowadays is "UTF-8", as it allows for "Unicode", i.e. covering all written languages in the world. Depending on the character to be stored, UTF-8 uses 1 to 4 bytes for a single character.
US-ASCII, usually simply designated as "ASCII", is a very old encoding, only supporting "English" characters. It uses 1 byte per character. Essentially, ASCII is a subset of UTF-8.

UTF-8 is a "variable-width encoding", meaning that either 1 byte is used, or several bytes are used, depending on the character. That makes it an extremely efficient encoding.

- 1 byte: ASCII characters
- 2 bytes: Other Latin alphabets: Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic
- 3 bytes: Chinese, Japanese, Korean (CJK)
- 4 bytes: less common CJK, historic scripts, mathematical symbols, emoji

So, for Chinese characters in UTF-8 (recommended by the NMPA), one needs 3 bytes for a single character.

The also mentioned EUC-CN encoding uses 2 bytes per character, but is not supported by many systems, and is limited in the number of chinese characters it supports ("Simplified Chinese").

What are the consequences of using Chinese characters in SAS-XPT for SDTM/SEND/ADaM?

The SDTM specification states that variable names and test codes may not be longer than 8 characters, and labels and test names not more than 40 characters. This limitation is due to the (by the FDA) mandated SAS-XPT format, and is not entirely correct. The real statement should be that variable names and test codes may not occupy more than 8 bytes. The statement "8 characters" is entirely based on the assumption that ASCII-encoding is used.

So, for Chinese characters encoded as UTF-8, this means that variable names may not longer than int(8/3) = 2 Chinese characters, and labels not more than int(40/3) = 13 Chinese characters.
For variable values, the limit becomes int(200/3) = 66 Chinese characters.
For variable names and test codes, there is no real problem, as NMPA probably accepts that the variable names and test codes are "English", i.e. use ASCII encoding.
A huge problem however occurs for variable labels.

For example, when I translate the label "Reported Name of Drug, Med, or Therapy", I get "报告药品报告药品名称,药物治" which is ... 14 Chinese characters. Or: "Dictionary-Derived Term for the Healthcare Encounter", translates as: "词典中针对医疗保健遇到的术, also being 14 characters, i.e. taking 42 bytes, which is beyond the 40 byte limit. For CDISC test names, I haven't tried yet, but I suppose that some of them will, when translated to Chinese, be longer than 13 Chinese characters, and thus take more than 40 bytes. The reason for the 40-character (when using English, and supposing ASCII encoding) is that a test code becomes a column label when transposing, and labels may not be ...

Also questionable is what to do when a variable value takes more than 66 Chinese characters. How would one then need to split? Anyhow, such limitations are not of these times anymore, they were acceptable in the time of the punch cards, but we are living in the 21st century now.

Another huge problem is that there are currently no viewers at all that support non-ASCII encoding in SAS-XPT files. The often used "SASViewer" (currently not available from SAS anymore) and the "SAS Universal Viewer" [https://support.sas.com/downloads/package.htm?pid=667] do not support any other encodings that ASCII for XPT files. This was confirmed to me by SAS Support. For example, when I load an XPT file with Chinese characters, I get:


Some stated that using SAS Transport Version 8 would take away all of the above objections and limitations. This is however not entirely true: also SAS Transport 8 assumes that all characters are encoded as ASCII, and does not support non-ASCII encodings like UTF-8 or EUC-CN. This also means that, just like for version 5, there are no viewers: when I would load a SAS Transport 8 file with Chinese characters into the SASViewer or SAS Universal Viewer, I would get the same result: the Viewer would not recognize the Chinese characters (as it assumes ASCII encoding), and the Chinese characters do not display correctly.

So, why did NMPA choose for SAS Transport 5?

Well, I asked them, but did not get an answer. So I asked the question to people that have good connections to the NMPA, like some members of the "China CDISC Coordination Committee" (C3C). From the discussions with these excellent colleagues, I got the strong impression that NMPA just wants to more or less copy the requirements of the FDA, except than for the use of the language. NMPA did not seem to have thought about (or not understood) what the consequences of using SAS-XPT are.

Are there better solutions?

Of course, there are - we are living in the 21st century! Already in 2014, CDISC published the "Dataset-XML" standard, which was exactly meant as a replacement for SAS-XPT. It is based on XML, a modern, worldwide really open standard (i.m.o. SAS-XPT is semi-propriety), completely vendor-neutral (SAS-XPT isn't), and used in every industry, so not only in healthcare or clinical research (clinical research is the only industry still using XPT). XML supports any encoding, with the default encoding being ... UTF-8. XML does not have any of the limitations of SAS-XPT. Furthermore, Dataset-XML can be written and read by any modern software, including by SAS and by R statistical packages. Also other CDISC standards such as Define-XML and ODM are using the XML format. It is even so that Dataset-XML is based on both ODM and Define-XML, making it an "end-to-end" solution. That is also why the combination of Define-XML with Dataset-XML is often called "a marriage blessed in heaven".

So I also asked to my Chinese colleagues why NMPA is not recommending CDISC Dataset-XML format. The question that came back was whether FDA and PMDA already accept Dataset-XML. When I then explained about the Dataset-XML FDA pilot, and that the introduction of Dataset-XML has been put on ice, I got the answer (I cite): "If FDA/PMDA adopt XPT only, it will be difficult for NMPA (who has just joined ICH) to be the first agency to adopt dataset-xml. We may have to wait and see what decision other agencies to make".
What this has to do with ICH, I do not understand, as ICH does not mandate SAS-XPT format. Even the other way around, ICH's eCTD (electronic Common Technical Document) is based on ... XML.

For me it is clear that NMPA believes that by adopting/mandating SAS-XPT, it avoids risks. However, just the opposite is true: in my opinion, the use of SAS-XPT with Chinese characters will lead to huge problems at both the NMPA and at sponsors.

Some people asked me about my ideas about alternatives like using UTF-8 encoded CSV (comma separated values). So, I tested this and even added it to the list of supported formats in our famous SDTM-ETL mapping software. Such a CSV file then looks like (visualized in NotePad++):

Even such an extremely simple format would be a considerably better choice than SAS-XPT. When asked for a ranking for "suitability for Chinese characters in SDTM", I made the following table:

Transport format

Suitability Score

CDISC Dataset-XML

100

UTF-8 encoded CSV

50

SAS Transport 8

20

SAS Transport 5

10

Conclusions

SAS Transport 5 (SAS-XPT) format is the worst possible choice as a transport format for CDISC submissions to with Chinese characters to NMPA. It was developed for IBM mainframes (IBM mainframes did not support Chinese characters), and was never meant for anything else than "English" characters and ASCII encoding. It is not suitable at all for UTF-8 encoding and also never developed for that use case.

Several customers have come to me with questions about the new guidance of the NMPA and how to deal with it. My advice to them has been to negotiate the submission of their data sets in CDISC Dataset-XML format. If that is refused, they should propose UTF-8 encoded CSV, as that does not have any of the XPT limitations, is a simple format, and is still well readable by software packages such as SAS and R.