Traditionally, CDISC-CT is published as lists, usually in
Excel format, nowadays also as CDISC ODM (XML) files. The latter was a great
improvement. Even more important was that each list and each item
("term") in the list received an identifier by NCI, the "NCI
code".
Version management
One of the things we can however do considerably better is
version management.
CDISC-CT is published every 3 months. One of the Excel files
published is "Terminology Changes". In this file, we find all changes
with respect to previous versions, mostly additions, but also a number of
"updates", like "update codelist name", "update NCI
preferred term", and even "update CDISC definition".
These are highly problematic. Any other controlled
terminology in healthcare that I know has the policy to also change the
identifier when a term or its definition is changed. The old identifier is then
"deprecated", and users are requested to not use it anymore in new
applications. For example, the newest
LOINC version (2.64) contains 2920
deprecated terms (for historical reasons), and comes with a (database) table
with the mapping between the old and the new code (the "mapto"
table). The
latest CDISC-CT (2018-06-29) lists 185 "updates", of
which 44 are definition updates, and 1 is a submission value update. An example
of such a definition update is the submission term "DXA SCAN" (NCI
code C48789) in the "METHOD" codelist:
As as well the "submission value" as the "NCI
code" remain the same, this is highly problematic, because the meaning of
the value or code then depends on the version of the codelist. In the earlier
versions (before 2018-06-29) the code/term means ANY technique for scanning
bone density, whereas in the new version it is redefined to such a technique
where X-rays at different energy levels are used to ANY anatomical location. That
this change was necessary is probably due to a problem with quality management,
which might be a consequence of the very short publication cycles (every
quarter) of CDISC-CT. In this case, I presume that the wording "bone"
in the original definition was the problem as it essentially belongs to "
—SPEC" (specimen), not to
"method".
So, we now have the problem that the same submission value
with the same NCI code is used for i.m.o. two completely different things. Fortunately,
the Define-XML team recognized this problem and will as of version 2.1 allow to
indicate which CDISC-CT version was used for each individual variable/domain. Whether
reviewers will use this information to distinguish between "measuring bone
density" and "measuring density by X-Rays of ANY anatomical
location"? They probably won't.
It would have been much much better to deprecate "DXA SCAN" and
"C48789" and generate a new term and code for the X-Ray case of any
anatomical location.
Post-coordinated
versus Pre-coordinated
CDISC-CT is "post-coordinated", meaning that
concepts are kept broad and separate and selected and joined inthe process of searching. The
history (within CDISC) behind this is that at sites, either no coding was used
(terms were just written on paper), or each site had different coding systems,
e.g. for lab terms. With post-coordination, one can then still categorize and map such
captured information to a simple term and then combine it in a useful way with
other post-coordinated terms to come to meaningful data entry.
It is only in the last 10 years that healthcare is starting
using pre-coordinated systems, such as
LOINC (for tests),
SNOMED-CT,
ATC (for
active ingredients), and even
ICD-10 (for diagnoses). Pre-coordination is very
useful, also in clinical research, as it e.g. allows to define exactly which
tests need to be performed already in the protocol. This is still very rarely
done however. In protocols, we still often read e.g. "measure glucose in
urine", leading (at submission time) to a multitude of different glucose
tests (qualitative, quantitative, ordinal), with different test methods and
having different units, making results often incomparable, even between sites.
If possible, providing the pre-coordinated code for a test or property is
always better. It avoids the interpretation and translation / categorization
step. Even when the pre-coordinated code is known, CDISC standards still
require us to provide the post-coordinated terms too. If a laboratory apparatus
provides us the LOINC code, why should we then still try to translate that into
a LBTESTCD, LBTEST, LBSPEC, LBMETHOD? This is error prone. So, when LBLOINC can
be provided (without derivation), it should be allowed (or even mandated) to
discard variable values for LBTESTCD, LBTEST, LBSPEC, LBMETHOD. Only LBLOINC is
then identifying. However, there is still a lot of resistance against LOINC
within the CDISC-CT and SDTM teams: it was the FDA who recently mandated the
use of LOINC coding for lab tests (when the LOINC code is available), not
CDISC.
Semantic versus
pragmatic CT
Another problem comes with lists being incomplete or
overcomplete for practical purposes. A typical example is "OTHER" and
"MULTIPLE". In the "RACE" codelist (C74457), they are
absent, although they are valid values for "RACE" in the
"Demographics" SDTM domain. The (further correct) argument of the
CDISC-CT has been that "other" and "multiple" are not
races. But how does a computer program know this? If it compares a CDISC
submission against the published codelist, it will not find "OTHER"
nor "MULTIPLE" and thus needs to throw an error. In order to know
that "OTHER" and "MULTIPLE" are allowed, a HUMAN must
manually dig into the SDTM-IG and might ultimately find the section
"Assumptions for the Demographics Domain Model" (p. 65 in SDTM-IG
3.2) and there then ultimately find the following paragraph:

As the SDTM-IG is still not machine-readable, this is highly
problematic. Software Implementors of the SDTM-IG must over and over again try
to find out this kind of information by digging into the text, and then program
this into their software, which is tedious, error prone and often open for different
interpretations.
Even then, CDISC-CT is not always consequent in this matter.
For example, in the codelist "Acute Coronary Syndrome Presentation
Category" (ACSPCAT, C101865) we find a submission value "OTHER":
"Other" is semantically surely NOT a category of
an acute coronary syndrome presentation. It is just the text of a checkbox in
the CRF.
A good number of CDISC codelists contain the term
"UNKNOWN". Examples are the "Action Taken with study
treatment" codelist (ACN, C66767), the "Cardiac Procedure Indication"
codelist (CVPRCIND, C101859), the "Coronary Vessel Disease Extent"
codelist (CVSLDEXT, C17998), and even the "Ethnic group" codelist
(ETHNIC, C66790).
"Unknown" is surely semantically NOT an ethnic group, it is again the
text on a checkbox in a CRF.
So, depending on the codelist (or CT subteam?), it looks as
different criteria are used to determine whether a term may belong to that list
or not. It thus seems that there is a lack of systematic approach in the
development and maintenance of the CDISC codelists. This probably is due to the
history of CDISC-CT: in some cases, the lists were developed by collecting every
possible value that was found as a checkbox on CRFs, others were developed
using a semantic approach.
How do other controlled terminologies handle this? ICD-10
for example follows a pragmatic approach: almost every group/class in ICD-10
also has a term "other XYZ". For example, for "cerebrovascular
diseases", we find as well "other nontraumatic intracranial hemorrhage"
(I62) as well as "other cerebral infarction" (I63.8). So, ICD-10 does
not seem to have any problem with "other" not being a semantically
valid disease.
One can have discussions for many many hours about
principles of developing controlled terminology, but there is an easy practical
solution for this: mark those terms that are semantically belonging to the
parent term (i.e. the codelist), and mark those separately that are
(additionally) allowed values, and DO this in a machine-readable form, so that
implementors do not need to start digging into a non-machine-readable SDTM-IG.
This might be something like:
Or any other construct that indicates that
"MULTIPLE" and "OTHER" are not real races but are also
allowed submission values in the context of the SDTM-IG. One could of course
further extend this with information about the version of the SDTM-IG, the
applicable domain/variable, any applicable rules (preferably machine-readable)
etc., but this beyond the context of the current blog entry.
For the "LBTESTCD" codelist this could look like:
Indicating that "LBALL" is also an allowed value
as a test code in the case an SDTM submission.
Please don't shoot on me on how I exactly formatted this.
This is just one possibility of many. It is just about the principle of finding
a compromise between keeping a controlled terminology list semantically pure
(if desired) and still making it practical for its use in SDTM (or any other
implementation) in a machine-readable way.
In the context of practicability, I had a customer last week
asking me whether "U" ("Unknown") is an allowed value for –OCCUR variables. In the SDTM-IG, the
"YN" codelist ist indicated, e.g. for MHOCCUR:
referencing the "NY" codelist which, upon (manual)
inspection, contains the values "Y" ("yes"), "N"
("no"), "NA" ("not applicable") and "U"
("Unknown"). So, this would mean: "Yes, U is allowed for
MHOCCUR". However, manually digging into the non-machine-readable IG
(p.56) also reveals:
Which would mean "No, U is not allowed for
MHOCCUR".
However, we are not done yet! Further digging also reveals
(on p. 40):
Essentially stating that "U" is allowed for
MHOCCUR if there was a checkbox for it on the CRF.
Remind that all this regulation comes in a non-machine-readable
way, making it extremely difficult to implement this in software, not to speak
about differences in interpretation.
Also, this kind of rules could be embedded in a machine-readable construct as
proposed above.
The same applies to the SDTM –BLFL
("baseline") variables where only "Y" is allowed, this
although the SDTM-IG references the "YN" codelist. In this matter, it
is remarkable that the CDISC-CT has always refused to publish a "Yes
Only" codelist (though requested several times).
Lists and relations
The main problem of the current CDISC-CT however is that it is
just a set of
… lists.
This probably is again related to that some of the CDISC-CT was originally developed
for representing checkboxes on CRFs. Even now however, we still stick to lists
without little or no relationships between terms. We are thus missing enormous
opportunities - let me explain with a few examples.
The codelist "VSTESTCD" (vital signs test code)
contains over 40 codes, from "ABSKNF" (Abdominal Skinfold Thickness)
to "WSTCIR" (Waist Circumference):
Let me remark here that, at each new version, I regenerate
the "—TESTCD" codelists to
also contain the "—TEST"
value (as a "decode"), as unfortunately the CDISC-CT stopped doing
this for one reason or another. In doing so, the relation between the
test code and test name becomes directly visible (which I think is much more
friendly for software that uses it), instead of needing to do additional joins
(using the NCI code) within the software itself.
So, the VSTESTCD codelist as published by the CDISC-CT team
is just a list. Of course it also contains synonyms, definitions, …:
But it does not state anything about (semantic)
relationships between the terms within the list itself, nor with terms in any
other lists.
For example, it does not state that "systolic blood pressure" and
"diastolic blood pressure" are highly related (both are
cardiovascular properties), but that there is little or no relationship between
"body length" and "systolic blood pressure" except that both
are considered as "vital signs". Similarly, the CDISC-CT does not
state (at least not in a machine-readable way) that "BMI" and the
combination of "body weight" and "body height" are highly
related. It would even be easily possible to express this relationship (a
formula) within the CDISC-CT publications in a machine-readable way.
Another example is "ALBUMIN" in the
"LBTESTCD" (Laboratory Test Code) codelist. Remark that
"Albumin" is essentially not a lab test, it is the analyte that is
measured in a lab test. So
–TESTCD
has different meanings in SDTM, depending on the domain
– in VS it is the property measured, in LB it is the
analyte that is measured. But that is essentially an SDTM problem, not a
CDISC-CT problem.
Why is albumin tested in healthcare and in clinical research? What is it
related to? CDISC-CT does not tell us at all. CDISC-CT does even not tell us
that the "albumin test" is usually part of the "liver test
panel". Another (and better) coding system for lab tests (precoordinated
however),
LOINC, does at least provide us this information:

And even providing the information which other lab
tests belong to the same panel, e.g. telling us that a "bilirubin"
test is highly related to our "albumin" test. CDISC-CT does not
provide us such information, partially because "LBTESTCD" is
post-coordinated and essentially is about analytes, not about individual tests
(as LOINC does).
Also remark that LOINC also provides us the "usually used"
corresponding units in UCUM format, a format that is still mostly ignored by
CDISC-CT, and still not accepted by SDTM, although it is THE format used by
electronic health record systems worldwide (CDA, FHIR, …). This is a huge problem in the era that much
of our data is coming from EHRs or hospital information systems anyway. SDTM
should at least accept UCUM format, besides existing CDISC-CT for units. The
further development of the latter should be stopped – it does not make sense anymore.
Fortunately, we now have
UMLS (Unified Medical Language
System). UMLS is trying to cover most of the coding and controlled terminology
systems used in medicine, and to provide relations between them. Each term or
code in an healthcare coding system like LOINC, SNOMED-CT and also CDISC-CT
also has a code in UMLS. This allows to generate "networks of
knowledge", e.g. connecting the CDISC-CT term "albumin" with the
LOINC code 1951-7 (Albumin [Mass/volume] in Serum/Plasma) with the
"hepatic function panel", with the organ "liver"
(SNOMED-CT) and with the disease "hepatitis" (SNOMED-CT, ICD-10),
etc..
UMLS even makes RESTful web services available allowing to develop applications to build such "knowledge networks".
One of my students is currently developing an application (for his master
project) to interactively generate and display such "knowledge
networks". I will later report on this in a separate blog.
However, where possible and useful, we should not leave it
to UMLS to do the work on relations between terms for us.
Public reviews
Another problem is that the CDISC-CT team keeps publishing
drafts for public review only as Excel files. Excel is not a vendor-neutral
format, nor is it an open standard. This makes it difficult to do QA and impact
analysis on changes relative to earlier versions. If the CDISC-CT does not do
such an analysis, it should at least be made easy for the public to do so
during the review period. In my personal opinion, it would also be much better
to have a longer cycle (e.g. once a year) and have a longer public review
period so that a broad discussion is possible prior to final publication.
What we can do
better
Here are a few proposals (my opinion of course) how we can
do better in the development of CDISC-CT:
- Start better version management. For example, we could agree
that the January 2019 version is used as the "base version" for the
future. If something for a controlled term is changed later (such as the
definition), deprecate that term and NCI code, and provide the changed term
with a new "submission value" as well as a new NCI code.
- Improve quality management. Many of the "definition
changes" are probably due to a somewhat overhasty inclusion of new
requested terms. Also, there does not seem to be any "impact
analysis" done on changes. Some bright people have developed methods andtools for this. These can and should also be used by the CDISC-CT team.
-
Stop further development of codelists for which there are
better alternatives, but still allow their terms to be used in submissions. The
"UNIT" codelist ist the most notorious of them. Further development
of the LBTESTCD codelist does not make sense in my opinion either. And why do
we need our own lists of microorganisms whereas real specialists have already
developed complete ontologies?
- Allow alternative codelists that do the job better. For
example, for LBTESTCD (which is not representing a test but an analyte), we may
as well use the "component" part of the LOINC code. This would
however require to move away from the 8-character limitation of –TESTCD variables. The users could then choose
between either using the "old" CDISC-CT codelist and using the
"component" from LOINC for LBTESTCD.
- If pre-coordinated codelists can be used, do so. For
example, for lab tests (but the same applies to vital signs), if the LOINC code
is available, e.g. from an EHR, use it (LBLOINC, VSLOINC) and in such a case,
do NOT (manually?) populate TESTCD, TEST, SPEC, METHOD, as this is redundant information and usually becomes
error prone.
- Be consequent and pragmatic. The example of
"OTHER" being excluded or included in different codelists clearly
demonstrates that the development principles between codelist (and CT subteams?)
do differ. This is due to the history of CDISC-CT. Take a step back and rethink
the develop the development principles which then need to be followed for each
codelist. Best (i.m.o.) is to use a pragmatic approach with terms that
"natively" semantically belong to the codelist are marked separately
from terms that are allowed under certain conditions (such as
"MULTIPLE", "OTHER", "UNKNOWN"). Extremely
important is that this is done in a machine-readable way. One could even
express such conditions in a machine-readable way within the codelist itself.
And no, this is NOT the responsibility of other teams (such as the SDTM team)
alone, it is ALSO the responsibility of the CDISC-CT team.
- Where possible and useful, provide relations between terms.
For example, in "vital signs", there is a clear relationship, even
described by a formula, between "BMI" and "height" and
"weight". This formula can even be incorporated in the CT itself.
Common "parent" terms can also be added, e.g. providing the relation
between "systolic blood pressure" and "diastolic blood
pressure". We should not leave it to UMLS to define relationships between
our terms.
-
In order to improve the quality of CDISC-CT, make the
release cycles longer again. Twice a year should be more than sufficient. Do
not include new requested terms that were not well quality-reviewed and for
which no impact analyses was perform. Publish public review packages in a modern,
vendor-neutral format, so that future implementors can do an impact analysis on
their systems. Also provide more time for such reviews. If a new term is
heavily discussed, retard its publication until consensus (also with the
broader community) is reached, or do not add it at all.