Saturday, December 30, 2017

CDISC-CT 2017-12-22: more madness

In my previous posts, I reported about the madness that goes on in the development of CDISC (especially) lab test codes, and the problems related to the CDISC approach. For example, the CDISC approach does not allow to define tests like "maximum in the last 24 hours" or "average over the last 24 hours" (e.g. for vital signs temperature, blood pressure, or concentration of a substance in blood or urine). Such definitions are however an integral part of the LOINC coding system, over the "time aspect".

Now that the FDA has mandated the use of LOINC coding for laboratory tests, it would be expected that CDISC stops the development of an alternative system for lab tests. The latest CDISC controlled terminology (dated 2017-12-22) however again contains over 40 new lab test codes.


There are several reasons for this.

First of all, we need to take into account that CDISC lab test codes are NOT lab test codes, they only specify "what" is measured. This corresponds to the "analyte/component part" in LOINC. So, for example, the CDISC "GLUC" ("glucose") "test code" essentially represents hundreds of different tests where glucose is somehow (presence, qualitatively or quantitatively) measured. So, CDISC-CT is "post-coordinated", meaning that it needs to be combined with content from other variables to uniquely describe a test. In practice, however, this does not work: with the CDISC system, reviewers can never find out whether test A in one study of one sponsor is the same as test B in another study from another sponsor: only the LOINC code can do this, and this is exactly why the FDA started requiring LOINC coding for lab tests.
If we read the latest "Recommendations for submissions of LOINC codes", published by FDA, CDISC and the Regenstrief Institute, we read that even when LOINC codes are submitted, it is still mandatory to populate the CDISC-CT "lab test code" (which it isn't), and all other "classic" "identifying variables" such as the specimen, the method, etc.. I.m.o. this is stupid, as it adds redundancy to the record. For example, if the provided LOINC code has contents that deviate from the contents of LBTESTCD, LBSPEC, LBMETHOD, which of both then contains the truth? The LOINC code or the CDISC test code? I.m.o., this testifies that CDISC is still not ready for giving up it's own system (which is not a system, but just a list based on tradition), but needed to accept the decision of the FDA, though with displeasure. 
One of the arguments of CDISC for their "postcoordination" approach has always been that "research is unique", "does not dictate any tests" and that for many lab tests in research, there is no LOINC code. The latter is essentially not correct, as I have found out in the recent years. I estimate that for over 80% (if not over 90%) of the "test" codes published, there is at least one LOINC code (often many more) in the LOINC system. As I stated, LBTESTCD essentially corresponds to the "analyte/component" part of the LOINC system, and my conservative estimate is that for over 98% of the CDISC "test codes", there is an entry in the "analyte/component" list of LOINC (the latter can be obtained as part of a separarte database from the LOINC website).
The real reason for CDISC not giving up their system is probably (besides "not invented here") that CDISC is sticking to the 8-character limit for LBTESTCD. The "analyte/component" part in LOINC does not have this limitation.

What we see in the newest (2017-12-22) version of the CDISC-CT for lab tests is that for almost each of the new terms (when looking at the "CDISC definition", a corresponding entry in the "analyte/component" part can be found. The only major difference is that CDISC then additionally assigns a <8-character code to it. So, we are seeing the CDISC LBTESTCD values evolving into an <8-character representation of the "analyte/component" part of LOINC - if it wasn't that yet.

In the next months, I want to try to do some research on how "equal" LBTESTCD/LBTEST is with the "analyte/component" part, using a quantitative approach, for example by text comparison techniques like by calculating the "Levenshtein distance" between the value of LBTEST (or the CDISC definition) and the "analyte/component" part of LOINC.
The hypothesis of my research will be that LBLOINC is nothing else than a copy of the "analyte/component" part of LOINC, but then restricted to 8-characters.

If the hypothesis is found to be true, we might as well replace LBTESTCD/LBTEST with the "analyte/component" part of LOINC if we do want to keep a "post-coordinated" approach for SDTM (which I doubt we really need). This essentially would mostly correspond to what I proposed a few years ago in my article "An Alternative CDISC-Submission Domain for Laboratory Data (LB) for Use with Electronic Health Record Data", which i.m.o. combines "best of both worlds".

In order to have such a "best of both worlds" approach (my article can just be a starting point), we however need to remove the 8-character limitation on xxTESTCD, which is there for historical reasons only, and not for any technical reasons anymore. The SDTM team however seems not to be prepared to change anything there.

In my next blog entry, I will probably write something about the more than 230 "PK units" that have been added to the newest CT version, although there is a UCUM notation for each of them. 
Unfortunately, the title of that post will probably also need to contain the wording "CDISC-CT madness" ...

Sunday, December 17, 2017

The future of SDTM

Today, I looked into the newly published SDTM v.1.6 and the new SEND-IG-DART)

This new version is solely meant for SEND-DART (non-clinical sumission datasets: Developmental and Reproductive Toxicology). When going through both new standards, there were quite a number of very disturbing things (at least for me):

  • There are no machine-readable files. The SDTM v.1.6 comes as HTML, the SEND-IG-DART as a PDF. Essentially, this means a lot of frustrating copy-and-paste for those who want to implement these standards into their systems and software.
  • As there is no machine-readable version, all the "rules" and "assumptions" are not machine-readable anyway, thus leaving them open for different interpretations. It is then also foreseeable that a certain company that is working for the FDA will "highjack" the interpretation of the rules and use it for commercial purposes.
  • This version of SDTM is solely meant for SEND-DART. This is very worrying. SDTM which was named "SDS" (Submission Data Standard) in earlier days and has always be meant to be a "universal" model for both SDTM (human trials) as for SEND (non-clinical / preclinical studies). Here is the "big picture" (copied from the CDISC website):

    When starting naming different "flavors" of the "universal" SDTM standard "versions", we are doing something really wrong. "Standards Versions" should be subsequent, a newer version replacing the older one. Unfortunately, this is not the case anymore.
  • We see more and more that some SDTM variables are only allowed/meant to be used in a single domain. Also here, some new veriables have been added which can only be used in one or only a few domains. For me, this evolution demostrates the failure of the SDTM model anyway.
  • The model is again tightly coupled to the outdated SAS-XPT format: variable names and test codes not longer than 8 characters, labels not longer than 40 characters and values not longer than 200 characters. Only US-ASCII is allowed. Such a direct coupling between a model and a transport format is nowadays as an "absolute no-go" in modern informatics.
  • As in prior versions, model and IG contain a lot of derived variables. As SDTM is essentially about "captured data", derived variables should not be in SDTM
  • Also this version of SDTM sticks to "tables" (2-dimensional). Now, there is nothing against tables, but in order to guarantee high quality data, there should be explicit relations between the tables, without any data redundancy. This is what relational databases are based on.
    SDTM however breaks almost every rule of a good relational database, with lots of data redundancy (inevitable leading to reduced data quality), with many unnecessary variables added "for the sake of ease of review" (sic), but essentially ruining the model.
So, what can be done? How should the future of SDTM look like? Let us make a "5-year plan". We can divide this short term, middle term and long term actions.

Short term actions
  • In case FDA and PMDA cannot guarantee that they will accept Dataset-XML very soon, replace SAS-XPT by a simple, easy-to-use, vendor-neutral format that does not overstrain FDA and PMDA.
    It is now clear that FDA and PMDA do not have the capability (or do not want) to switch from XPT to the modern Dataset-XML format. Concerns about file sizes ("a submission might not fit on a memory stick") and inexperience with XML anyway look to be the current "show stoppers".
    As a temporary solution (the "better than nothing solution") but already solving a lot of the limitations of SAS-XPT, a simple "bar-delimited" (also named "pipe-delimited") but UTF-8 encoded simple text files can be used ("HL7-v2 like"). For example (LB dataset):

    Such datasets are very compact, on an average take only 25% of the corresponding SAS-XPT file size, and are easy to import into any software. There is no 8-, 40-, or 200-character limit, and can easily handle non-ASCII characters such as Spanish (USA) and Japanese (Japan) characters.
    The metadata is all in the define.xml, but if also this is a problem for the systems at the regulatroy authorities, the first row can contain the variable names.
    Acceptance for this format could (technically) easily be established in a period of 6 months or less. However, this step can be skipped when FDA and PMDA implement Dataset-XML within a reasonable (<2 years) time.
    Once this done, we are at least freed from the "hostage" of the SAS-XPT format limitations, allowing us to take the next steps (SAS-XPT is currently the "show stopper" for any innovation). The acceptance of XPT should then be stopped within 2-3 years by the FDA and PMDA, to allow sponsors to adapt, this although "bar-delimited" and Dataset-XML files can easily be generated from XPT files.
  • Stop developing controlled terminology for which there is considerably better controlled terminology in the healthcare world. This comprises controlled terminology for LBTESTCD, LBTEST and UNIT. Investigate whether this should also apply to other CDISC controlled terminology (e.g. microorganisms?).
    This step does not mean that the use of the already developed terms is not allowed anymore, but it means that no effort is wasted in developing new terms anymore. Also remark that this may mean that some subteams are put on hold.
Mid term actions

Once we are "freed" from SAS-XPT, we can take the next steps
  • Decide which controlled terminology should be deprecated. For example, I expect the "COMPONENT" part of LOINC to be a better alternative for LBTESTCD/LBTEST. Databases for these are already available. Remark that "COMPONENT" in LOINC is limited to 255 characters in length, so considerably more than the ridiculous 8 characters in LBTESTCD. But that is not a problem as the transport format has no length limitations for fields at all.
    The "deprecation time" in which the old terminology is faded out can then be agree e.g. to be 5 years. For UCUM, I think the case is clear: we can no longer afford to disconnect from e-healthcare
  • Considerably improve our relationships with other SDOs (HL7, Regenstrief, NLM) in healthcare, not considering them as "the enemy" anymore, but being prepared to learn from them, even deprecating some of our standards in favor of well-established ones in healthcare.
  • As SDTM is not fit for e-Source, develop new Findings domains that are fit for e-Source, probably using LOINC and other modern coding systems as identfiers for tests. As long as not everything is e-Source, these domains will probably be in parallel with the existing domains. This time do it good: do not allow for derived and data-redundant variables.
    This step is not as easy as it looks: it would mean that for these domains, SDTM becomes a real relational database, which also has the consequence that the data-redundant variables that were introduced "for ease of review" will not be present anymore (leading to higher data quality), and that review tools at the regulatory authorities will needed to be adapted, i.e. that they will need to implement "JOINS" between tables (for relational databases, this might mean creating "VIEW" tables).
    This step will require a change in mentality for both CDISC and the regulatory authorities: for CDISC from "we do everything the FDA/PMDA (reviewers) ask us", to a real partnership, where CDISC helps the FDA and PMDA implementing these new, improved domains. This may mean that CDISC's own consultants work at FDA and PMDA for some time to help adapting their systems. This looks more difficult as it is, as it essentially reduces in implementing foreign keys and creating "VIEW"s on tables. Essentially, it would also mean that CDISC and FDA work together on the validation rules and their technical implementation, so that high quality validation rules and implementations of them become available, very probably as "real open source" (the current validation software used by the FDA and PMDA is less than suboptimal and based on the own interpretation of the SDTM-IGs by a commercial company)
  • Switch from a simple transport to a modern transport format (which might, but must not be Dataset-XML), allowing for modern review using RESTful web services, as e.g. delived by the National Library of Medicine and others, allowing "Artificial Intelligence" for considerably higher quality (and speed) of review.
  • Start thinking of the SDTM of the future. Must it be tables (the world is not flat, neither is clinical data - Armando Oliva, 2009)? Execute pilots with submissions of "biomedical concepts", and "linked data" using a transport format that is independent from the model.
Whereas the "short term" can be limited to something like 6 months, the "middle term" will probably take something like 2-3 years. This step will surely require "bye-in" from FDA and PMDA. Within CDISC, there is so much expertise which is currently not used: we do have a good number of brilliant volunteers (some of we lost to other organizations such as Phuse) who can help bring ourselves and the regulatory authorities to the next level of quality in review.

Long term actions
  • SDTM is highly probably not the ideal way to submit information to the regulatory authorities. Even when "cleaned", removing unnecessary and redundant information, a set of tables should not be the "model", it should only be one of the many possible "views" on the data. At this moment, only the "table view" is essentially used, or it must be that some (but not all) reviewers have their own "trick box" (own tools) to get more out of the data.
  • In the "middle term" period, we should already start looking into using "biomedical concepts" for submission, following ideas already developed by some of our volunteers and a number of companies. We might even already do pilots with the regulatory authorities at this point
  • In the "long term" we must come to a better way of submitting information, part of which will be in the form of "biomedical concepts". When looking at HL7-FHIR, I see that their "resources" and "profiles" are extremely successful and very near to what we need in clinical research, also for submissions.
  • Work together with other organizations to come to a single model for care and research in the medical world. With the upcome of wearables, site-less studies, interoperable electronic health records in many countries, we can no longer afford to work in isolation (or even claim that clinical research is "special").
Personally, but who am I, I would not be surprised that we make it happen that e.g. FHIR and CDISC standards evolve into a single standard in 10 years from now.

Saturday, December 2, 2017

An e-Protocol Annotation Tool

As part of my professorship in medical informatics at the Institute of e-Health at the University of Applied Sciences FH Joanneum, I also have a little bit of time to do some more "basic" research. This research is often not funded, as "applied sciences" universities in Austria only get money from the state for teaching activities.

In the last days, I started working on a clinical research protocol tool.
It is still extremely primitive, but I want to share my first results with you anywhere.

The tool makes extreme usage of RESTful web services, for example the NLM RESTful web services, web services from HIPAA (require an account and token), UMLS RESTful web services, and of course, our own RESTful web services.

CDISC-SDTM annotation

Much of the information from the protocol finally goes in the SDTM submission to the FDA or PMDA. For example, there is a lot of information that goes in the SDTM "TS" (trial summary) dataset. The protocol can be annotated with the information about where the information needs to go into the TS dataset and with which parameter name.
The same info also goes into clinical trial registry submissions, ideally using the CDISC CTR-XML standard.
Here is a short demo about how the annotation works:

... and so on ...
As one can see, the user cannot only annotate the part where the code should be assigned to (yellow), but also the value of the code or parameter (green).
This one of course is a "easy pray" for an artificial intelligence program. So in my opinion, assigning and retrieving such "trial summary parameters" can easily be automated.

LOINC annotation

With this tool, annotating laboratory tests with their LOINC code becomes very easy. A simple demonstration is shown here:

SNOMED-CT annotation

For SNOMED-CT annotation, I used the UMLS RESTful web services API. Please remark that these require a UMLS account and API token, and possibly a (country) SNOMED-CT license. A short demo is shown here:

If you do not have a UMLS account and API token, you can of course always a "Google Search" which can be started from within the tool.

Other types of annotations that can currently be used are UMLS, ICD-10 (WHO) and the ATC (Anatomic, Therapeutic, Chemical) classification system for therapeutic drugs.