Saturday, November 24, 2018

New FDA Validation Rules - October 2018

The FDA has recently updated its SDTM and SEND validation rules which can be downloaded as an Excel file (!) from the FDA website. Immediately after, Pinnacle21 published an overview (also in Excel) of all the changes relative to the prior version (February 2017). So I wonder whether these rules were developed by the FDA itself, or that it was outsourced to a specific software company.

So I started evaluating this set of rules, and especially the changes as published by Pinnacle21.

Unfortunately, once again, the set of has been published as a worksheet, which is machine-readable in some way, but not machine-interpretable. The worksheet does even not contain pseudo-code for the rules execution, as e.g. was published by the SDTM team when they published their "SDTMIG Conformance Rules" early 2017.

As in prior versions, one must sometimes just guess what is being tested anyway. For example rule SD1321/FDAB001:

What is being tested here?
The "business rule" is identical to the one of P21 rule SD1097, so what is the difference? One can only guess from the "FDA Validator" message, which has, in brackets: "Missing SUPPAE". So, if this is indeed what is being tested, the rule should say: "SUPPAE must be present in the submission". It doesn't.

Interesting is also that the line with "SD1321" is marked as "new" (in yellow), and as the line with "SD1097" isn't but states that it is the same FDA rule, it looks as this is just an additional validation test for the "same", already existing (February 2017) rule. The same two lines also appear in the FDA worksheet. But was really is or must be tested, remains in the dark. Probably only one company knows this …

Another example of a new rule that is (i.m.o.) completely unclear is rule SD2006/FDAB003:

What is being tested here? Is it being tested whether a MedDRA code appears in a SUPPQUAL domain? If so, how can that be tested anyway? By the "QNAME"? Or by information in the define.xml? No indication at all is provided.

Interesting is that it is now also indicated when the rule is also applicable to SEND, and to which version of SEND. That is of course a good thing.
Worse is that some rules present in the February 2017 version that are nonsense have not been removed. For example the rule SD0026/FDAB012:

Stating that a unit MUST be provided in --ORRESU when --ORRES is populated.


I know thousands of tests in the given domains (EG, LB, VS, …) for which there is no unit. For example, what is the unit for "NEGATIVE"? Even for tests where the result is a number, like pH, there is no unit. In the prior version of the P21 validation software, it was tried to add some "logic" for when a unit is necessary, but this often ended in disaster, with a lot of false positives.

At least for LB and VS, whether a test has a unit or not could be retrieved from the LOINC code e.g. using a RESTful web service, as has been made available by us, the National Library of Medicine, and LOINC itself. LOINC coding is however still not a requirement in SDTM.  Also, the validation software used by the FDA does not use any RESTful web services at all.

There is unfortunately a lot of such rules in the new version, where one has refused to learn from the lessons in the past, and just copied the "nonsense", "un-understandable" or "expectation" rule from the prior version.

Interesting is also that the FDA worksheet now also contains the CDISC rules, assigns a number/code to them, and even provides error messages for them, although all of this is not in the competence of the FDA at all. This is of course also to blame to CDISC, as CDISC never made a serious attempt to provide or publish a "reference implementation", i.e. an implementation of the CDISC rules to which all vendors must comply, meaning that vendors implementations must exactly provide the same results as the reference implementation. This is e.g. how Java library implementations are developed. More about "reference implementations" in the next section.

We suppose however that adding the CDISC rules to the FDA rules (with new IDs) is a further attempt of the company that we suppose created the rules in assignment of the FDA, to "monopolize" the implementation of the rules: the "application logic" in "pseudo code" as published by CDISC does not even appear in the FDA worksheet. They will simply use their own interpretation.

A recent article from Pinnacle21 lso reveals some interesting information regarding the future of the "Community" version of the validation software. It states "Last 5 columns, highlight new rules in the next release of Community" and more important: "If you are a Pinnacle 21 Enterprise user, all FDA Validator Rules are already available to you. For Community users, these new rules will become available in 3.0 release scheduled for January." So it looks as there WILL be a new "Community" version. It was promised to us for November (2017), but nothing happened. I was already afraid that there would never be a new "Community" version anymore, as Pinnacle21 is pushing companies very hard to purchase the (expensive) "Enterprise" version (which I do understand - they are not a charity).

Reference implementations

When a standard is being established, it has become usual to also provide a "reference implementation". This is an implementation in software that fulfills the minimal requirements to
adhere to the standard and to which all other (e.g. third party) implementations must comply, i.e. give the same results for all test cases as the reference implementation.
For example, the Java "Jersey" library for RESTful web services is the reference implementation of the standardized JAX-RS API to which all third party libraries implementing RESTful web services must comply.

Unfortunately, CDISC did not provide reference implementations for SDTM/SEND rules in the past, as it did not want to go into software development. The problems was that it did not think about formalizing "rules" from the SDTM-IG in the first place! It is the virtue of OpenCDISC (later renamed to Pinnacle21 as CDISC did not want that its name was used) to have developed such rules, however as their own interpretation of the SDTM-IG. It allowed to validate SDTM (and later also SEND and ADaM) submissions. No wonder that the FDA was highly interested in this. However, we still must take into account that what the FDA is using, is an interpretation of the SDTM-IG by a single company, and not the CDISC interpretation. CDISC recognized having lost control of its own "rules" pretty late, and finally also started publishing validation rules, first for SDTM, and later also for ADaM, containing "pseudo code" for the validation checks. Real "reference implementations" however to which vendors must comply have however not been published by CDISC.

Such a "reference implementation" for CDISC/FDA/PMDA rules could be the current "Open Rules for CDISC Standards" project within CDISC. It aims to provide rules in a machine-readable (and -executable) language that is at the same time very well human-readable, so they are completely open. Such rules are NOT embedded in software, they come independent of any software, but can be used by software in a vendor-neutral way, i.e. any vendor can use these machine-readable rules in its software, or develop software that generates exactly the same results for each possible test case. These rules can be implemented using Java, C#, Python … software languages. So really language and operating system independent. A few CDISC volunteers already develop a good number of such machine-readable rules. You can find their results here.


The new version of the FDA rules for SDTM and SEND are completely along the lines that were followed in the past. No corrections were done for rules that are completely wrong (i.e. nonsense), can essentially not be implemented in software, or are expectations rather than rules. The rules have been developed and published in such a way that only one vendor (probably the company that supposedly developed the rules) can implement them in software (not vendor-neutral). Including the CDISC owned rules into the "FDA rules" publication gives a good oversight, but gives the impression that one wants to establish a monopoly on the software implementation of all the rules, either from FDA or from CDISC. The way these rules have been developed and published i.m.o. violates the requirement that it must be vendor-neutral. But we are already used to that. Also the XPT format is not really vendor-neutral (the TS-140 "specification" is very hard to implement in non-SAS software), and i.m.o. also discriminates Spanish-speaking patients in the US (it does not support "Spanish" characters).

A good alternative can be the future results of the "Open Rules for CDISC Standards" project (though already a lot has been done), but this would require the FDA to switch to a modern transport format (like XML or JSON).

Tuesday, September 18, 2018

CDISC-CT: we can do better

The song of Matt Simmons "we can do better" is at the moment very popular here in western Europe. Every time I hear it, I must think about CDISC Controlled Terminology (CDISC-CT). I will explain why.
Traditionally, CDISC-CT is published as lists, usually in Excel format, nowadays also as CDISC ODM (XML) files. The latter was a great improvement. Even more important was that each list and each item ("term") in the list received an identifier by NCI, the "NCI code".

Version management

One of the things we can however do considerably better is version management.
CDISC-CT is published every 3 months. One of the Excel files published is "Terminology Changes". In this file, we find all changes with respect to previous versions, mostly additions, but also a number of "updates", like "update codelist name", "update NCI preferred term", and even "update CDISC definition".

These are highly problematic. Any other controlled terminology in healthcare that I know has the policy to also change the identifier when a term or its definition is changed. The old identifier is then "deprecated", and users are requested to not use it anymore in new applications. For example, the newest LOINC version (2.64) contains 2920 deprecated terms (for historical reasons), and comes with a (database) table with the mapping between the old and the new code (the "mapto" table). The latest CDISC-CT (2018-06-29) lists 185 "updates", of which 44 are definition updates, and 1 is a submission value update. An example of such a definition update is the submission term "DXA SCAN" (NCI code C48789) in the "METHOD" codelist:

 As as well the "submission value" as the "NCI code" remain the same, this is highly problematic, because the meaning of the value or code then depends on the version of the codelist. In the earlier versions (before 2018-06-29) the code/term means ANY technique for scanning bone density, whereas in the new version it is redefined to such a technique where X-rays at different energy levels are used to ANY anatomical location. That this change was necessary is probably due to a problem with quality management, which might be a consequence of the very short publication cycles (every quarter) of CDISC-CT. In this case, I presume that the wording "bone" in the original definition was the problem as it essentially belongs to "SPEC" (specimen), not to "method".

So, we now have the problem that the same submission value with the same NCI code is used for i.m.o. two completely different things. Fortunately, the Define-XML team recognized this problem and will as of version 2.1 allow to indicate which CDISC-CT version was used for each individual variable/domain. Whether reviewers will use this information to distinguish between "measuring bone density" and "measuring density by X-Rays of ANY anatomical location"? They probably won't.
It would have been much much better to deprecate "DXA SCAN" and "C48789" and generate a new term and code for the X-Ray case of any anatomical location.

Post-coordinated versus Pre-coordinated

CDISC-CT is "post-coordinated", meaning that concepts are kept broad and separate and selected and joined inthe process of searching. The history (within CDISC) behind this is that at sites, either no coding was used (terms were just written on paper), or each site had different coding systems, e.g. for lab terms. With post-coordination, one can then still categorize and map such captured information to a simple term and then combine it in a useful way with other post-coordinated terms to come to meaningful data entry.

It is only in the last 10 years that healthcare is starting using pre-coordinated systems, such as LOINC (for tests), SNOMED-CT, ATC (for active ingredients), and even ICD-10 (for diagnoses). Pre-coordination is very useful, also in clinical research, as it e.g. allows to define exactly which tests need to be performed already in the protocol. This is still very rarely done however. In protocols, we still often read e.g. "measure glucose in urine", leading (at submission time) to a multitude of different glucose tests (qualitative, quantitative, ordinal), with different test methods and having different units, making results often incomparable, even between sites.

If possible, providing the pre-coordinated code for a test or property is always better. It avoids the interpretation and translation / categorization step. Even when the pre-coordinated code is known, CDISC standards still require us to provide the post-coordinated terms too. If a laboratory apparatus provides us the LOINC code, why should we then still try to translate that into a LBTESTCD, LBTEST, LBSPEC, LBMETHOD? This is error prone. So, when LBLOINC can be provided (without derivation), it should be allowed (or even mandated) to discard variable values for LBTESTCD, LBTEST, LBSPEC, LBMETHOD. Only LBLOINC is then identifying. However, there is still a lot of resistance against LOINC within the CDISC-CT and SDTM teams: it was the FDA who recently mandated the use of LOINC coding for lab tests (when the LOINC code is available), not CDISC.

Semantic versus pragmatic CT

Another problem comes with lists being incomplete or overcomplete for practical purposes. A typical example is "OTHER" and "MULTIPLE". In the "RACE" codelist (C74457), they are absent, although they are valid values for "RACE" in the "Demographics" SDTM domain. The (further correct) argument of the CDISC-CT has been that "other" and "multiple" are not races. But how does a computer program know this? If it compares a CDISC submission against the published codelist, it will not find "OTHER" nor "MULTIPLE" and thus needs to throw an error. In order to know that "OTHER" and "MULTIPLE" are allowed, a HUMAN must manually dig into the SDTM-IG and might ultimately find the section "Assumptions for the Demographics Domain Model" (p. 65 in SDTM-IG 3.2) and there then ultimately find the following paragraph:

As the SDTM-IG is still not machine-readable, this is highly problematic. Software Implementors of the SDTM-IG must over and over again try to find out this kind of information by digging into the text, and then program this into their software, which is tedious, error prone and often open for different interpretations.
Even then, CDISC-CT is not always consequent in this matter. For example, in the codelist "Acute Coronary Syndrome Presentation Category" (ACSPCAT, C101865) we find a submission value "OTHER":


"Other" is semantically surely NOT a category of an acute coronary syndrome presentation. It is just the text of a checkbox in the CRF.
A good number of CDISC codelists contain the term "UNKNOWN". Examples are the "Action Taken with study treatment" codelist (ACN, C66767), the "Cardiac Procedure Indication" codelist (CVPRCIND, C101859), the "Coronary Vessel Disease Extent" codelist (CVSLDEXT, C17998), and even the "Ethnic group" codelist (ETHNIC, C66790).
"Unknown" is surely semantically NOT an ethnic group, it is again the text on a checkbox in a CRF.

So, depending on the codelist (or CT subteam?), it looks as different criteria are used to determine whether a term may belong to that list or not. It thus seems that there is a lack of systematic approach in the development and maintenance of the CDISC codelists. This probably is due to the history of CDISC-CT: in some cases, the lists were developed by collecting every possible value that was found as a checkbox on CRFs, others were developed using a semantic approach. 

How do other controlled terminologies handle this? ICD-10 for example follows a pragmatic approach: almost every group/class in ICD-10 also has a term "other XYZ". For example, for "cerebrovascular diseases", we find as well "other nontraumatic intracranial hemorrhage" (I62) as well as "other cerebral infarction" (I63.8). So, ICD-10 does not seem to have any problem with "other" not being a semantically valid disease.

One can have discussions for many many hours about principles of developing controlled terminology, but there is an easy practical solution for this: mark those terms that are semantically belonging to the parent term (i.e. the codelist), and mark those separately that are (additionally) allowed values, and DO this in a machine-readable form, so that implementors do not need to start digging into a non-machine-readable SDTM-IG.
This might be something like: 

Or any other construct that indicates that "MULTIPLE" and "OTHER" are not real races but are also allowed submission values in the context of the SDTM-IG. One could of course further extend this with information about the version of the SDTM-IG, the applicable domain/variable, any applicable rules (preferably machine-readable) etc., but this beyond the context of the current blog entry.
For the "LBTESTCD" codelist this could look  like:

Indicating that "LBALL" is also an allowed value as a test code in the case an SDTM submission.
Please don't shoot on me on how I exactly formatted this. This is just one possibility of many. It is just about the principle of finding a compromise between keeping a controlled terminology list semantically pure (if desired) and still making it practical for its use in SDTM (or any other implementation) in a machine-readable way.

In the context of practicability, I had a customer last week asking me whether "U" ("Unknown") is an allowed value for OCCUR variables. In the SDTM-IG, the "YN" codelist ist indicated, e.g. for MHOCCUR:

referencing the "NY" codelist which, upon (manual) inspection, contains the values "Y" ("yes"), "N" ("no"), "NA" ("not applicable") and "U" ("Unknown"). So, this would mean: "Yes, U is allowed for MHOCCUR". However, manually digging into the non-machine-readable IG (p.56) also reveals:

Which would mean "No, U is not allowed for MHOCCUR".
However, we are not done yet! Further digging also reveals (on p. 40):

Essentially stating that "U" is allowed for MHOCCUR if there was a checkbox for it on the CRF.
Remind that all this regulation comes in a non-machine-readable way, making it extremely difficult to implement this in software, not to speak about differences in interpretation.
Also, this kind of rules could be embedded in a machine-readable construct as proposed above.
The same applies to the SDTM BLFL ("baseline") variables where only "Y" is allowed, this although the SDTM-IG references the "YN" codelist. In this matter, it is remarkable that the CDISC-CT has always refused to publish a "Yes Only" codelist (though requested several times). 

Lists and relations

The main problem of the current CDISC-CT however is that it is just a set of lists.
This probably is again related to that some of the CDISC-CT was originally developed for representing checkboxes on CRFs. Even now however, we still stick to lists without little or no relationships between terms. We are thus missing enormous opportunities - let me explain with a few examples.
The codelist "VSTESTCD" (vital signs test code) contains over 40 codes, from "ABSKNF" (Abdominal Skinfold Thickness) to "WSTCIR" (Waist Circumference): 

Let me remark here that, at each new version, I regenerate the "TESTCD" codelists to also contain the "TEST" value (as a "decode"), as unfortunately the CDISC-CT stopped doing this for one reason or another. In doing so, the relation between the test code and test name becomes directly visible (which I think is much more friendly for software that uses it), instead of needing to do additional joins (using the NCI code) within the software itself.
So, the VSTESTCD codelist as published by the CDISC-CT team is just a list. Of course it also contains synonyms, definitions, :

But it does not state anything about (semantic) relationships between the terms within the list itself, nor with terms in any other lists.
For example, it does not state that "systolic blood pressure" and "diastolic blood pressure" are highly related (both are cardiovascular properties), but that there is little or no relationship between "body length" and "systolic blood pressure" except that both are considered as "vital signs". Similarly, the CDISC-CT does not state (at least not in a machine-readable way) that "BMI" and the combination of "body weight" and "body height" are highly related. It would even be easily possible to express this relationship (a formula) within the CDISC-CT publications in a machine-readable way.

Another example is "ALBUMIN" in the "LBTESTCD" (Laboratory Test Code) codelist. Remark that "Albumin" is essentially not a lab test, it is the analyte that is measured in a lab test. So TESTCD has different meanings in SDTM, depending on the domain in VS it is the property measured, in LB it is the analyte that is measured. But that is essentially an SDTM problem, not a CDISC-CT problem.
Why is albumin tested in healthcare and in clinical research? What is it related to? CDISC-CT does not tell us at all. CDISC-CT does even not tell us that the "albumin test" is usually part of the "liver test panel". Another (and better) coding system for lab tests (precoordinated however), LOINC, does at least provide us this information:

And even providing the information which other lab tests belong to the same panel, e.g. telling us that a "bilirubin" test is highly related to our "albumin" test. CDISC-CT does not provide us such information, partially because "LBTESTCD" is post-coordinated and essentially is about analytes, not about individual tests (as LOINC does).
Also remark that LOINC also provides us the "usually used" corresponding units in UCUM format, a format that is still mostly ignored by CDISC-CT, and still not accepted by SDTM, although it is THE format used by electronic health record systems worldwide (CDA, FHIR,
). This is a huge problem in the era that much of our data is coming from EHRs or hospital information systems anyway. SDTM should at least accept UCUM format, besides existing CDISC-CT for units. The further development of the latter should be stopped it does not make sense anymore.

Fortunately, we now have UMLS (Unified Medical Language System). UMLS is trying to cover most of the coding and controlled terminology systems used in medicine, and to provide relations between them. Each term or code in an healthcare coding system like LOINC, SNOMED-CT and also CDISC-CT also has a code in UMLS. This allows to generate "networks of knowledge", e.g. connecting the CDISC-CT term "albumin" with the LOINC code 1951-7 (Albumin [Mass/volume] in Serum/Plasma) with the "hepatic function panel", with the organ "liver" (SNOMED-CT) and with the disease "hepatitis" (SNOMED-CT, ICD-10), etc..
UMLS even makes RESTful web services available allowing to develop applications to build such "knowledge networks". One of my students is currently developing an application (for his master project) to interactively generate and display such "knowledge networks". I will later report on this in a separate blog.
However, where possible and useful, we should not leave it to UMLS to do the work on relations between terms for us. 

Public reviews

Another problem is that the CDISC-CT team keeps publishing drafts for public review only as Excel files. Excel is not a vendor-neutral format, nor is it an open standard. This makes it difficult to do QA and impact analysis on changes relative to earlier versions. If the CDISC-CT does not do such an analysis, it should at least be made easy for the public to do so during the review period. In my personal opinion, it would also be much better to have a longer cycle (e.g. once a year) and have a longer public review period so that a broad discussion is possible prior to final publication.

What we can do better

Here are a few proposals (my opinion of course) how we can do better in the development of CDISC-CT:

  • Start better version management. For example, we could agree that the January 2019 version is used as the "base version" for the future. If something for a controlled term is changed later (such as the definition), deprecate that term and NCI code, and provide the changed term with a new "submission value" as well as a new NCI code. 
  • Improve quality management. Many of the "definition changes" are probably due to a somewhat overhasty inclusion of new requested terms. Also, there does not seem to be any "impact analysis" done on changes. Some bright people have developed methods andtools for this. These can and should also be used by the CDISC-CT team. 
  • Stop further development of codelists for which there are better alternatives, but still allow their terms to be used in submissions. The "UNIT" codelist ist the most notorious of them. Further development of the LBTESTCD codelist does not make sense in my opinion either. And why do we need our own lists of microorganisms whereas real specialists have already developed complete ontologies?
  • Allow alternative codelists that do the job better. For example, for LBTESTCD (which is not representing a test but an analyte), we may as well use the "component" part of the LOINC code. This would however require to move away from the 8-character limitation of TESTCD variables. The users could then choose between either using the "old" CDISC-CT codelist and using the "component" from LOINC for LBTESTCD. 
  • If pre-coordinated codelists can be used, do so. For example, for lab tests (but the same applies to vital signs), if the LOINC code is available, e.g. from an EHR, use it (LBLOINC, VSLOINC) and in such a case, do NOT (manually?) populate TESTCD, TEST, SPEC, METHOD, as this is redundant information and usually becomes error prone.
  • Be consequent and pragmatic. The example of "OTHER" being excluded or included in different codelists clearly demonstrates that the development principles between codelist (and CT subteams?) do differ. This is due to the history of CDISC-CT. Take a step back and rethink the develop the development principles which then need to be followed for each codelist. Best (i.m.o.) is to use a pragmatic approach with terms that "natively" semantically belong to the codelist are marked separately from terms that are allowed under certain conditions (such as "MULTIPLE", "OTHER", "UNKNOWN"). Extremely important is that this is done in a machine-readable way. One could even express such conditions in a machine-readable way within the codelist itself. And no, this is NOT the responsibility of other teams (such as the SDTM team) alone, it is ALSO the responsibility of the CDISC-CT team.
  • Where possible and useful, provide relations between terms. For example, in "vital signs", there is a clear relationship, even described by a formula, between "BMI" and "height" and "weight". This formula can even be incorporated in the CT itself. Common "parent" terms can also be added, e.g. providing the relation between "systolic blood pressure" and "diastolic blood pressure". We should not leave it to UMLS to define relationships between our terms. 
  • In order to improve the quality of CDISC-CT, make the release cycles longer again. Twice a year should be more than sufficient. Do not include new requested terms that were not well quality-reviewed and for which no impact analyses was perform. Publish public review packages in a modern, vendor-neutral format, so that future implementors can do an impact analysis on their systems. Also provide more time for such reviews. If a new term is heavily discussed, retard its publication until consensus (also with the broader community) is reached, or do not add it at all.