Saturday, December 2, 2017

An e-Protocol Annotation Tool

As part of my professorship in medical informatics at the Institute of e-Health at the University of Applied Sciences FH Joanneum, I also have a little bit of time to do some more "basic" research. This research is often not funded, as "applied sciences" universities in Austria only get money from the state for teaching activities.

In the last days, I started working on a clinical research protocol tool.
It is still extremely primitive, but I want to share my first results with you anywhere.

The tool makes extreme usage of RESTful web services, for example the NLM RESTful web services, web services from HIPAA (require an account and token), UMLS RESTful web services, and of course, our own RESTful web services.

CDISC-SDTM annotation

Much of the information from the protocol finally goes in the SDTM submission to the FDA or PMDA. For example, there is a lot of information that goes in the SDTM "TS" (trial summary) dataset. The protocol can be annotated with the information about where the information needs to go into the TS dataset and with which parameter name.
The same info also goes into clinical trial registry submissions, ideally using the CDISC CTR-XML standard.
Here is a short demo about how the annotation works:



... and so on ...
As one can see, the user cannot only annotate the part where the code should be assigned to (yellow), but also the value of the code or parameter (green).
This one of course is a "easy pray" for an artificial intelligence program. So in my opinion, assigning and retrieving such "trial summary parameters" can easily be automated.


LOINC annotation

With this tool, annotating laboratory tests with their LOINC code becomes very easy. A simple demonstration is shown here:



SNOMED-CT annotation

For SNOMED-CT annotation, I used the UMLS RESTful web services API. Please remark that these require a UMLS account and API token, and possibly a (country) SNOMED-CT license. A short demo is shown here:



If you do not have a UMLS account and API token, you can of course always a "Google Search" which can be started from within the tool.


Other types of annotations that can currently be used are UMLS, ICD-10 (WHO) and the ATC (Anatomic, Therapeutic, Chemical) classification system for therapeutic drugs.

Thursday, November 16, 2017

CDISC-CT madness: a solution

In my previous post, I reported about the ongoing madness in the development of new CDISC controlled terminology (CT) for laboratory tests, microbiology and a few others, where CDISC developed its own CT whereas much better systems (not lists) already exists.

Yesterday, I found another jewel of where this madness leads to. It is about the development of the "Ebola Therapeutic Area User Guide", and how the observations "highest temperature in the last 24 hours", "highest pulse in the last 24 hours" and "lowest blood pressure in the last 24 hours" should be represented as SDTM. Here is a small part of the very long discussion from the CDISC wiki:

In the SDTM Vital Signs dataset, systolic and diastolic blood pressure must be defined by VSTESTCD=SYSBP and VSTESTCD=DIABP. So, in the CT phylosophy, "highest value in 24 hours" must be added through an additional variable. But which one? We have other such variables, such as VSPOS (position),VSLOC (location), and VSLAT (laterality), but there is none such for expressing the "maximum value".
In the post on the SDTM wiki, it is suggested that VSTSDTL (Vital Signs Examination Detail) be added and used. The contents of it would essentially be free text which is not so well understandable by machines.

However, there is a very simple solution!

"Maximum in the period of 24 hours" is defined by LOINC. So I was not surprised that I could find exact LOINC codes for the four cases mentioned. A simple search (<5 minutes) in the LOINC database, led to following results:

  • Highest temperature in 24 hours:
    LOINC Code: 8315-4
    LOINC (short) Name: Body temperature:Temp:24H^max:XXX:Qn
    LOINC Long (common) Name: Body temperature 24 hour maximum


  • Highest pulse/heart rate in 24 hours:
    LOINC Code: 
    8873-2LOINC (short) Name: Heart rate:NRat:24H^max:XXX:QnLOINC Long (common) Name: Heart rate 24 hour maximum

  • Lowest diastolic blood pressure in 24 hours:
    LOINC Code:
    8477-2LOINC (short) Name: Intravascular diastolic:Pres:24H^min:XXX:Qn
    LOINC Long (common) Name:
    Diastolic blood pressure 24 hour minimum
  • Lowest systolic blood pressure in 24 hours:
    LOINC Code: 8495-4LOINC (short) Name: Intravascular systolic:Pres:24H^min:XXX:Qn
    LOINC Long (common) Name: Systolic blood pressure 24 hour minimum
"Voila!" - problem solved! While the SDTM people are discussing for hours, we could solve the problem in 5 minutes by just using a much better coding system than the CDISC-CT system.
All we then need to do is to add these LOINC codes in VSLOINC, and we are ready. If it were not mandated by the SDTM-IG, we could even leave VSTESTCD and VSTEST out, as the value of VSLOINC containing the LOINC code has the full and 100% correct and exact description of the test that was performed. The details of what the code means can be easily retrieved by any modern tool by e.g. using RESTful web services.

 It is really time that the SDTM and CDISC-CT teams start rethinking a few things...

===== Addition =====

As a little "encore", here how you can easily find out using our RESTful web services:


What you see is a snapshot of an application developed by one my (bright) students, Ms. Eva Winter (as part of her Bachelor thesis), allowing to do a dynamic search: each time a character is entered (or after each word, by entering a space or "Enter"), the RESTful web service is called, and the list (ordered by relevance) of proposed LOINC codes is updated. The selected LOINC information can then be saved as XML or as JSON.


===== Addition 2 =====

You can of course also use the excellent LOINC RESTful web services developed by the US NLM (National Library of Medicine). For example, in order to get the 10 most relevant LOINC codes for the search term "systolic blood pressure maximum", use the request "https://clin-table-search.lhc.nlm.nih.gov/api/loinc_items/v3/search?type=question&terms=systolic%20blood%20pressure%20maximum&ef=LONG_COMMON_NAME&maxList=10", and you will obtain similar results in JSON format.

Sunday, November 12, 2017

Excel and other flat file data to CDISC ODM

I have been so naive ...

After 15 years of CDISC ODM standard, I believed that its use is well spread and every EDC system supports in (at least have the capability to export as ODM). The enemy is another ...

I recently was in the clinical research center of a large hospital (no, I won't say where is was). On the wall there was a list with all the clinical studies that were running (over 100 studies). Some of them are run over the whole country. The table also showed a column with "EDC system". To my surprise, in about 60% of the studies, the EDC system listed was ... MS Excel. Even one of the (national) largest studies, with thousands of patients was using Excel as "the EDC system".

I recently had a potential customer for my SDTM-ETL mapping software. I gave two web conference demonstrations, as the customer has offices in several parts of the world. One of the questions I received was "can it read data from Excel worksheets?". I was astonished ...

The  SDTM-ETL mapping software uses CDISC ODM as input. Metadata is read from the "Study" element of an ODM file, allowing to use drag-and-drop for many of the SDTM variables. The mappings are stored in a define.xml file. Upon execution, the clinical data is read from an ODM file with clinical data.

When Excel is the data source however, what is the metadata? At the best, one has one or a few rows containing a label or an ID.

This all made me rethink ...

So I developed a new software to allow to transform exports from Excel as CSV files, but also other character delimited "flat" files, to CDISC ODM. I gave the software the name "ODM Generator".
With this software, clinical data centers that still use Excel or get data in any other flat file format (often the case for laboratory data) can ow easily transform these data into CDISC ODM and then combine these with metadata and clinical data from other sources.

Here is a snapshot of the main window:


In this case, the user already has selected a "flat" data file, and the system automatically recognizes that the vertical bar "|" was used as a delimiter. You can then set whether string are embedded in single or double quotes (usually the software will detect it itself - this is especially important when the delimiter is a comma). Also, if the first line of your file contains the headers of the columns, one should check the checkbox "First line contains column names". This will not only remove the first line from the loaded data, but later also propose the column names as the names for the ODM items.
In most cases, you will want to view the data as a table to see whether the import has been correctly. In order to do so, click the button "Show file as table". A new window than opens and shows all the data as a table:


Each column simply obtains a simple identifier: "F1", "F2", and so on. In case the first line contains the column headers, these are shown as a tooltip on the column:


 One can then start generating the ODM metadata by clicking the button "Start generating ODM metadata". The system will then analyze the data in the dataset and make a first proposal for the data types, the maximal length and other "ODM Item" metadata:


In case the first line contains the column names, the result is slightly different:


i.e. the column names will be proposed the become the "Item Name". One can now start completing the proposed metadata that will be used to generate an ODM.
Using the "validate" button (near the bottom) is always a good idea as some of the fields are mandatory or optional depending of the datatype:

A result of such a validation may for example be:

For fields 12, 13, and 18, a maximal length of 0 was set. In this case, this was due to the complete absence of data for this field in the source data.

In the next step, we want to assign the field containing the subject ID, and the visits, the forms, and the groups of items.
Based on the metadata, we see that the field "Subject ID or Identifier" (F12) is a very good candidate. We however also see (from the table with the actual data) that it is never populated. So we will use field 11 (Screen ID) as the subject identifier.
In order to do so, we use the dropdown for "Field for Subject ID", and select "F11":


The second level we need is the visit. Here field 20 "Visit ID or number" is a very good candidate:


and select "F20" from the dropdown for "Field for Visit (StudyEvent)":


We also need to define the ID of the form ("Field for form"). However, we don't find anything. Obviously, we might want that all these laboratory data go into a single form. So, we don't select anything for the dropdown "Field for Form". If no selection in one of the dropdowns on top of the window has been made, the system will always assume that there is a single one, and generate it itself. So the top of our window looks like:


In many cases, we will not want that all data goes into the ODM. For example, we could limit the data that go into the ODM to those that we later want to map to SDTM using the SDTM-ETL software. So we only select those fields that are of interest for further processing. For example:


We include fields 5 (Study ID), 8 (Site ID), 9 (Investigator ID), 10 (Investigator Name), 15 (Subject Sex) and a good number of other ones (not visible in the above image). Remark that in most cases, we will not select fields for which there is no data. In this case, we can easily detect them as the suggest value for "Length" is 0. This will not always be the case, as we might have a file where some data is missing, whereas we expect it in later files.
Also remark that field 11 has disappeared from the list, as we selected it for the "Field for Subject ID" and thus is included automatically.
When we now do a validation (using the "Validate" button) we see that only the selected fields are validated. If we still find validation issues, we might want to change our selection.

It is now time to save our mappings!
We do so using the button "Save mappings" which we find near the bottom of the window:


A file chooser will pop up, with a file chooser, allowing you to select a file and location. The generated file is a simple text file, with the content looking like:


When we later want to continue or work, or transform another file with similar data, we can simply load this mappings file from within the main window using the "Load Prior Mappings" button:




This mappings file can also later be used when doing batch processing of files with data (software currently in development).

In our "mappings" window, we can now start generating the ODM metadata and data using the button "Export as ODM":



A dialog is displayed:


You might want to either generate only the metadata (this will generate an ODM "Study" file), clinical data only (this will generate an ODM "ClinicalData" file), or a file with as well metadata (Study definition) as well as clinical data.
The second option will usually be used when you already generated the metadata before and you have loaded earlier developed mappings.

If we want to have our ODM study metadata and the clinical data in separate files, we first select "Metadata only", do an export, then select "Clinical data only" and then do a separate export for the clinical data only.
In our case, we want to have as well metadata as clinical data, so we select "Metadata + clinical data". After clicking the "OK" button, a new dialog is displayed asking us to provide a study identifier (in ODM: StudyOID). We use "MyStudy":

   

After "OK" again, a new dialog is displayed asking us for the study name, description and protocol title. These will go into the ODM elements "StudyName", "StudyDescription" and "ProtocolName": 


Again after "OK", a file selector is displayed allowing us to define where the file needs to be created and with which file name.
The ODM file that is then generated looks like:






with 3 "StudyEvents" generated (from the data itself), a single "default" form (as we selected nothing for the dropdown for "Form"), and a single "default" ItemGroup:



For each of the selected fields, we also see the metadata generated:




and with clinical data having:


When collapsing each of the "SubjectData", we see that clinical data for 3 subjects (taken from the source file) have been generated:


If we have saved the study metadata as a separate file, we could start refining the study metadata e.g. using the "ODM Study Designer" software, add the "question" for each data point definition (optional), adding codelists and their references (e.g. for "Subject Sex"), and even add SDTM mapping information using the "Alias" mechanism. We can validate our ODM metadata and data using the (for CDISC members free) "ODM Checker", and generate and execute mappings to SDTM using the SDTM-ETL software.

In near future, we will extend the features of the "ODM Generator". A few ideas we already have:
  • batch execution using saved mappings
  • different ItemGroups. For example for lab files, this can be based on the "test category"
  • further refinement of the automatically generated proposals in the metadata
With this new product, I hope that the 60% of the studies in the clinical research center of the large hospital I recently visited and that are using MS Excel as "the EDC system", will now be easily transformed to CDISC ODM, the worldwide standard for exchange of clinical research data and metadata.























Sunday, October 22, 2017

CDISC-CT: the madness goes on

I recently installed the newest CDISC controlled terminology (CDISC-CT) version 2017-09-29 in my databases for use with our CDISC RESTful web services.

When doing so, I noticed that more than 30 new lab test codes (LBTESTCD/LBTEST) have been added, this although the FDA has mandated the use of LOINC coding (in variable LBLOINC) as of March 15th, 2018. The SDTM-IG still states that LBLOINC is the "Dictionary-derived LOINC Code for LBTEST", but in most real cases, it is just the other way around: LBTESTCD and LBTEST are derived from the LOINC code, as that is what is delivered (or should be delivered) by the central or hospital labs. So esssentially, LBLOINC should be the "topic variable", not LBTESTCD.

Already more than 2 years ago, I published an article "An alternative CDISC-Submission Domain for Laboratory Data (LB) for Use with Electronic Health Record Data" which was well received, except for within CDISC. I had hoped it would become a starting point for a discussion within CDISC about how we avoid "the reinvention of the wheel" in SDTM, and better connect to the controlled terminology and coding systems that are worldwide used in healthcare. It looks however that the CDISC-CT refuses to correct its course, and goes on developing "lists of terms" that do not have a connection with what is used in healthcare and in science in general.

Except from terms for lab tests, CDISC has also developed lists of ... microorganisms (codelist MICROORG, NCI code C85491), in the latest version containing 1506 terms.



When I recently discussed this codelist with people from a tropical medicine institute, they asked my about the systematics and taxonomy of this codelist. Unfortunately, I had to admit that the codelist 'MICROORG' does not have a system nor taxonomy at all - it is just a list. They then asked me why CDISC is not using a worldwide used system that has a taxonomy and relations between them such as the NCBI taxonomy of cellular organisms.

The CDISC-CT codelist "MICROORG" contains the term "Absidia", without any information about its relationship with other organisms. Only in the "CDISC Definition" column, it states (as narrative text, i.e. unstructured) that it is a fungus. Such narrative texts are barely machine-interpretable, and thus also unsuitable for use in e.g. artificial intelligence systems.
Just for fun, I entered "Absidia" in the NCBI taxonomy browser. This is the result I got:



It does not only show me that there are a lot of types of "Absidia", it also shows me that it belongs to the "family" of "cunninghamellaceae", the "order" of "mucorales", which is in the "subphylum" of "mucoromycotina" which is in ..., i.e. we can easily retrieve the whole taxonomy. Through the "taxonomy ID" (4828), we could easily use RESTful web services to have our own systems find out information about this organism and e.g. to build "networks of knowledge" (I haven't checked yet whether such a RESTful service is provided by NCBI - one is surely provided by UMLS).

Does CDISC-CT provide this functionality? Not at all. It even does not provide us any information about how we can generate the CDISC term (which is surely not used in laboratories) from the usually used term such as the NCBI term.

So, also for microorganisms, it does not make sense CDISC "reinventing the wheel" and to develop and maintain "yet-another-codelist".
In my opinion, CDISC should stop developing codelists for which better, internationally recognized, systems or nomenclature already exists. Examples are LOINC for lab tests, UCUM for units, NCBI for microorganisms. CDISC should deprecate the own codelists ("lists of terms") when such a better, internationally recognized" system exists.
Some people will immediately state that this will lead to extra columns in the SDTM tables and undermines the SDTM systematics "test code / test name / test result", where for each domain, there is only one (CDISC) codelist for the test code allowed.
My proposal is that when such a better system exists, there would not only be a column "test code", but also a column "codelist system", containing either the "CDISC codelist name", or the name of the international coding system. For the latter, we can orient ourselves to the code systems used in FHIR:



If there is only 1 code system used within a table, it can just be listed as an "ExternalCodeList" in the define.xml.
This also means that in many of the "Findings" SDTM tables, only those "record qualifiers" can be maintained that are really necessary. For example, when using LOINC for lab tests, LBCAT, LBSCAT, LBSPEC and LBMETHOD become obsolete, as they are already provided by the LOINC code itself, and can easily be retrieved (in addition to a lot of even more useful information) by any modern application through the use of RESTful web services.

A major roadblock to come to a considerable better SDTM, with "biomedical concepts" instead of "rows in tables" is still the SAS Transport format. It does not allows us to have codes longer than 8 characters (Oh my God, what time are we living in), it does even not allow us to have a compact format for test results like in FHIR:

or even to provide different codes in different code systems:


So in order to make the next "quantum leap", and move SDTM out of the 20th century, we must not only start to use the internationally recognized code systems (instead of developing and maintaining our own "reinvention of the wheel" codelists), we must also finally get rid of SAS Transport 5 and move to a modern XML / JSON / RDF representation of SDTM data. For the latter, we need the cooperation of the FDA who stills mandates the use of this 30-year old, completely outdated format.

For the CDISC controlled terminology, we need a change in mentality in the CDISC-CT development teams. If that doesn't work or doesn't happen, it is time that the CDISC board takes action:
please STOP THIS MADNESS!



Saturday, September 2, 2017

Alexa, can I put mm[Hg] in VSORRESU?



This blog entry is an extension to the discussions on thePinnacle21 forum and a discussion on the “LinkedInSDTM Experts” forum regarding the standardization of original result units, and the controversial CDISC UNIT codelist, and the use of UCUM notation.

We live in an era where artificial intelligence (AI) and machine learning (ML) are booming. But, in order to apply it to CDISC SDTM, we need an information basis. This basis should probably be the SDTM-IG. Unfortunately, the SDTM-IGs are all published as PDF documents, which are essentially not machine-interpretable. This makes the information very hard to implement in software, and e.g. apply AI.

For example, when I work with electronic health records (EHRs), for which I get the blood pressure readings with the unit mm[Hg] (UCUM notation THE international standard for unit notation), and I want to know whether I should copy mm[Hg] into VSORRESU, or whether I need to replace it with the CDISC notation mmHg from the CDISC UNIT codelist, I must (as a human) start fighting myself through the SDTM-IG and hope to find an answer there. 


If we look at the different versions of the SDTM-IG, we observe that the published (PDF) document grows considerably with each new version. For example, SDTM-IG v.3.1.2 (2008) has 298 pages, whereas the version 3.2 (2013) has 398 pages, excluding the IGs for medical devices (additional 60 pages) and for associated persons (additional 31 pages). When I first took the 2-day SDTM course a number of years ago, the trainer (Peter VR) could explain everything (including all domains and the “assumptions”) in these 2 days. I recently asked him whether this is nowadays still possible and he answered me that it is not, and that the training can only treat the principles, and even not all the domains. Also the complexity has increased, and in my opinion, the “learning curve” has considerably steepened with each new version of the IG, making it very hard for beginners to start working with SDTM. I sometimes think about asking our university to set up a master degree in CDISC-SDTM, with addition of an additional study year at each new published version of the SDTM-IG ...


At the same time, the SDTM validation software used by the FDA (developed by a company with no ties to CDISC) has contributed to the confusion by defining validation rules that are over-interpretations of the SDTM-IG, are just wrong or throw a large amount of false positives.
The “SDTM-IG Conformance Rules” published by CDISC itself were a great step forward, but  essentially came too late – they should have been published together with the SDTM-IG itself, and even better, as part of the IG. These rules have also been implemented under the “OpenRules for CDISC Standards”initiative in a modern, completely open format.


SDTM and artificial intelligence (AI)
 
Instead of needing to have a “master degree in SDTM”, wouldn’t it be better that one (or even better, our computer programs) just queries something like “Alexa, can I put mm[Hg] in VSORRESU?”? Alexa then would probably answer something like “No, Jozef, you need to replace mm[Hg] by mmHg as both are semantically the same and the latter is part of CDISC controlled terminology and the former is not, and CDISC does unfortunately not allow UCUM notation yet”.  

What would such an SDTM-AI system need to be able to answer such questions? First it needs to know that “mm[Hg]” and “mmHg” are semantically identical. However, “mm[Hg]” isn’t even listed in the CDISC controlled terminology as a “synonym”, this although it is a term from the international standardized notation in international sciences, healthcare, engineering, and business. Secondly, it would require that the SDTM-IG is in a machine-readable format (PDF isn’t), with clearly defined rules (if possible machine-interpretable, like in the CDISC “SDTM-IG Conformance Rules”).
 
A first simple proposal for a machine-readable SDTM-IG has been made in the past, but this proposal seems to have gone almost unnoticed by the SDTM team (SDTM team members: please correct me if I am wrong!), and the next version of the IG (likely to have 500 pages or more?) will be published as … PDF. A request to also publish the next SDTM-IG as XML has unfortunately been turned down by the CDISC SDTM team

In order for the "SDTM-Alexa" (SDTM-AI system) to provide an answer to the question whether "mm[Hg]" (from the EHR) can be put into VSORRESU, the system needs to find the guidance in the SDTM-IG. Here it is (with many thanks to Carlo R for looking up and bringing it up in the discussion - ), from “Assumption 7” in the SDTM-IG, LB (“laboratory”) section:


essentially stating that one should first check whether the own term/unit (in this case “mm[Hg]”) is listed as a “synonym” for “something else” in the CDISC controlled terminology, and if not found, a “new term request” should be submitted to CDISC.
Honestly said, the latter is not a real option, as this process usually takes 6 months or more, and if the request is turned down, zero progress is made.


In our earlier proposed prototype of a SDTM-IG in XML, each Assumption is an own XML element instance, for example:




Although structured, this doesn't make it machine-executable nor suitable for AI. In order to make it usable for AI or ML, we need a machine-executable expression or an algorith, which could look like:

Although structured, this doesn’t make it machine-executable. In order to be able to use this in AI, we need a machine-executable expression or an algorithm, which could look like:

a) Submit the “suspected synonym” to a web service (or other system) that looks whether the value (“mm[Hg]” in this case) has been published by CDISC-CT as a synonym, and if so, for what it is a synonym.
b) If the answer is “no” (or “null”), automatically make a request to the NCI 

As the latter is not really an option, b) could be replaced by:

c) Extend the “UNIT” codelist in the define.xml with the own term, and put the “own term” in VSORRESU.

For step a) I created a RESTful web service this morning which is documented at:
http://xml4pharmaserver.com/WebServices/#sdtmtermsynonym
If one submits our example mm[Hg] to the RESTful web service (http://www.xml4pharmaserver.com:8080/CDISCCTService/rest/SDTMTermFromSynonym/mm%5BHg%5D), one obtains:
 


Containing an empty response meaning that “mm[Hg]” is not found to be a “synonym” of anything. This is rather strange as this is the mandatory notation in EHRs, but this does not seem to be honored yet by the CDISC-CT team.

Another example would be e.g. that we have measured a concentration in “mol/m3” for which we than submit a request to the RESTful web service with the result:

stating that there IS a synonym for mol/dm3 and that we need to replace it by mmol/L in LBORRESU.


So, all that would be needed is that such an algorithm is expressed as a machine-readable expression, and add it as such (probably by using a child element) to the “Assumption” element in the XML version of the SDTM-IG.


Some other rules or assumptions that could easily be implemented in a machine-readable SDTM-IG and can then be used for AI are things like that “—DY values are not allowed to be 0, so that a system could ask questions like Alexa, can VSDY be 0?
 
This is just one of the first ideas I have for first coming to a machine-readable SDTM-IG and then to smart SDTM systems using artificial intelligence or machine learning. This would not only greatly help flattening the very steep (and with each IG-version becoming steeper) learning curve, but also allow to automate mapping steps that are now done manually and, maybe even more important, help avoiding the many different interpretations of the SDTM-IG.


It however requires that the SDTM development team moves away from generating the SDTM-IG from Word documents (CDISC-JIRA may be of help here), with highly structured content (can be done using a database) and that the team allows specialists from other domains (XML, AI, ) to work with them and have a voice in the development.

We have self-driving cars, but for SDTM, we still rely on 30-year old technology. High time that we do something about this.