Monday, March 9, 2020

Extending the LOINC-to-SDTM-LB Mapping

Introduction and background

The CDISC controlled terminology team has made a great job in developing a mapping between the pre-coordinated LOINC terminology and post-coordinated SDTM-LB variables. The mapping is currently in public review.

The mapping has been published as an Excel file, but can be, with some effort, converted into a relational database table. It currently contains approximately 2400 mappings for about 1500 LOINC codes. This already indicates that in some cases, there is more than one mapping, e.g. when the LOINC "system" is "Ser/Plas" which can be translated to SDTM LBSPEC being "SERUM", "PLASMA", or "SERUM OR PLASMA".

These mappings allow to automate the generation of LB datasets from other sources, such as hospital HL7-v2 messages, and especially useful, from electronic health records, as we recentlydemonstrated.

These 1400 LOINC codes surely cover a good part of the most common tests, but what to do when one gets test results for which the LOINC code is not among the 1400 mapped ones? Revert to manual work?

When one thinks however, that the mapping must have some systematics (which were partially published by CDISC), we can try to extend the mapping for cases and codes not covered by the mapping. For, example, the LOINC code "1750-9" (Albumin [Mass/Volume] in Semen) is not in the mapping, but only one of the 5 parts defining the LOINC code is different from LOINC code "1751-7" (Albumin [Mass/Volume] in Serum or Plasma), and "Semen" IS in the CDISC controlled terminology for "Specimen Type", to be used in LBSPEC. The NCI code for "Semen" is C13277.
The most important parts for these codes would then be:


LOINC 1755-1
Albumin [Mass/Volume] in Serum or Plasma
LOINC 1750-9
Albumin [Mass/Volume] in Semen
LBTESTCD
ALB
ALB
LBTEST
Albumin
Albumin
LBSPEC
either "SERUM", "PLASMA" or "SERUM OR PLASMA"
SEMEN
LBCAT
CHEMISTRY
CHEMISTRY
LBMETHOD
null
null




So, by trying to replace the value of one of the 5/6 "parts" in the LOINC "short name" (as we usually name it), by any other value for which there is a mapping available, would allow to extend the LOINC-to-SDTM-LB mapping with a good number of entries.

Methods:

Some of the systematics can easily be retrieved when one knows and understands both LOINC and SDTM-LB, others will be harder. The easy ones are:
  • There is an almost 1:1 mapping between the LOINC "component" and the CDISC "LBTESTCD".
  • There is a 1:1 relationship between SDTM "LBTESTCD" and "LBTEST".
  • Except for the case "Ser/Plas", there could well be a 1:1 relationship between the LOINC "system" (the body or anatomical system) and LBSPEC.
  • In principle, there should be a 1:1 relationship between the LOINC "example UCUM units" and the CDISC "example LBORRESU".
  • There probably is a 1:1 relationship between the LOINC "method", and the CDISC "LBMETHOD", where it is expected that the former has much more terms
For the latter, for example, we look up the LOINC method and CDISC LBMETHOD in the database, and order by "method", we can nicely see how there is a close relationship between both:


In a good number of cases (229 as I found out), it is even so that the value of "method" and LBMETHOD is identical, except for the casing. In other cases, it is very similar, e.g. "Calculated" versus "CALCULATION".

These "systematics" could easily be used to considerably extend the mappings, and even automate the process for generating the extended dataset. One technology that could be used for this Machine Learning (ML), but as I am not an expert at all in this area, I need something else (P.S. the experts in ML can see this as a nice challenge). So I looked further and came with the following algorithm:
  • Iterate over all the LOINC codes in the existing, by CDISC published mapping
  • For each entry, take the 5 parts from the "LOINC short name" ("method" will not very often be populated")
  • Iterate over the 5 parts, and use only 4 of them to look up all similar LOINC codes in the complete LOINC database (which has over 92,000 rows) to look up all "similar" codes, which differ in only one from the parts
  • This generates a good amount of new LOINC codes, all with the same values for 4 of the 5 parts, and a different value for the 5th part.
  • Using the existing LOINC-to-SDTM-LB mapping, try to find a mapping for the 5th part
  • If successful, generate a new mapping, and store it in a separate table that has the same structure as the  LOINC-to-SDTM-LB table
For our example, extending the mapping of LOINC code "1751-7" (Albumin [Mass/Volume] in Serum or Plasma, but for other specimen types ("system" column in LOINC), this would result in an SQL query:

SELECT * FROM loincdb_267.loinc WHERE component='Albumin' and property='Mcnc' and time_aspct='Pt' and scale_typ='Qn' AND system != 'Ser/Plas';
This leads to 25 hits, i.e.:



For each of the "system" values, we then do a lookup in the existing LOINC-to-SDTM-LB table, and for a good number of "system" (specimen / anatomical system), find a mapping, i.e.:

LOINC system
CDISC LBSPEC
Urine
URINE
Amnio fld
AMNIOTIC FLUID
CSF
CEREBROSPINAL FLUID
Body fld
FLUID
Plr fld
PLEURAL FLUID
Semen
SEMEN
Synv fld
SYNOVIAL FLUID

and so on. 

For each of these 25 hits, we can now easily generate a mapping, as many things are the same. We need to take a bit care for different values for "method" and thus also "LBMETHOD", and for the units, but we can take the same approach for "method" by finding a suitable LBMETHOD value in the original LOINC-to-SDTM-LB table. Similarly for the "units", where we can opt to copy the "UCUM unit" value into "example LBORRESU" when no mapping is found.

In order to later facilitate curation of the results, all our actions are logged into a log file.

But there even is an easier scenario: of the 2400 mappings, only about 680 have a "method" attached. In LOINC, "method" is only provided when it is absolutely necessary to distinguish (i.e. by different expectation values) from the usual test with no "method" mentioned.  So, if for each of the already existing mappings, we look for all LOINC codes in the complete LOINC that have the same values for the 5 base "parts", but have a different value for the "method", and then try to map the newly found "method" to LBMETHOD by searching for method-LBMETHOD pairs, this could lead to a considerably large amount of new mappings.

Six scenarios were defined, in each of them, one of the 6 "parts" was left free, i.e. we look for all LOINC codes in the LOINC database where this part can have any value (except for the one from the mapping entry), but all other are fixed to the one of the mapping entry. As "method" in LOINC is not always provided, we never fixed it, but always allowed it to vary. This also means that for each additional "method" found, an attempt was made to map it to one of the existing LBMETHOD values in the mapping dataset. In case nothing was found, we decided to generate a new LBMETHOD value by "uppercasing" the value from the LOINC method. I think this is reasonable, as the CDISC "Method" codelist (C85492) is extensible anyway, and the value in LBMETHOD contains important information for being able to distinguish with other tests. In such a case, this is also documented in the log file.

When leaving the LOINC "component" part free, we only stored a mapping when the new component did not force us to generate a new LBTESTCD/LBTEST pair, i.e. we only store it when there is already a mapping between "component" and LBTESTCD/LBTEST available in the original, by CDISC developed mappings. Generating new values for LBTESTCD (which is allowed, as the associated codelist is extensible) would have been difficult anyway, as there is this stupid rule that LBTESTCD values may not be longer than 8 characters, and many of the LOINC "component" values are considerably longer than 8 characters.

For the "example LBORRESU", we opted to always start from the "UCUM Unit" provided by the LOINC database, and map it, using existing entries in the mapping database. If no existing "CDISC unit" can be obtained, we copied the UCUM unit into the "example LBORRESU" and documented this in the log file.
For the LOINC database, only entries for which the status is "active" are used. This excludes entries that are "deprecated" or "discouraged".
All this was done automatically, i.e. executed by a software program, without manual intervention. This also means that curation for further finetuning the mapping may be necessary.

Results

The following table contains an overview of the results, especially indicating the number of new mappings obtained.
Scenario
Number
Scenario Description
Number of
new Mappings
1
Take LOINC code from CDISC mapping, fix "Component", "Property", "Time aspect", "System" and "Scale", and look for other entries in the LOINC database that have a different value for "Method"
616
2
"Component", "Property", "System", "Scale" are fixed. Look for other entries in the LOINC database that have a different value for the "Time aspect"
204
3
"Component", "System", "Scale" and "Time aspect" are fixed. Look for other entries in the LOINC database that have a different value for "Property"
497
4
"Component", "Property", "Scale" and "Time aspect" are fixed.
Look for entries in the LOINC database that have a different value for "System"
2444
5
"Component", "Property", "Time aspect" and "System" are fixed.
Look for entries in the LOINC database that have a different value for "Scale"
22
6
"Property", "Time aspect", "System" and "Scale" are fixed.
Look for entries in the LOINC database that have a different value for "Component"
3056

Some of the
the new mappings obtained starting from the LOINC code 1751-7 (Albumin [Mass/Volume] in Serum or Plasma) in the LOINC-to-SDTM-LB mapping are:

Scenario 1: "method" is different:
  • LOINC 61151-7: Albumin [Mass/volume] in Serum or Plasma by Bromocresol green (BCG) dye binding method
  • LOINC 61152-5: Albumin [Mass/volume] in Serum or Plasma by Bromocresol purple (BCP) dye binding method
Scenario 2: "Time aspect" is different:

no additional mappings

Scenario 3: "property" is different
  • LOINC 54347-0: Albumin [Moles/volume] in Serum or Plasma
  • LOINC 62234-0: Albumin [Moles/volume] in Serum or Plasma by Bromocresol purple (BCP) dye binding method
  • LOINC 62235-7: Albumin [Moles/volume] in Serum or Plasma by Bromocresol green (BCG) dye binding method
Scenario 4: "System" is different:
  • LOINC 1745-9: Albumin [Mass/volume] in Amniotic fluid
  • LOINC 1748-3: Albumin [Mass/volume] in Pleural fluid
  • LOINC 1749-1: Albumin [Mass/volume] in Peritoneal fluid
  • LOINC 1750-9 []: Albumin [Mass/volume] in Semen
  • LOINC 1752-5: Albumin [Mass/volume] in Synovial fluid
  • LOINC 1754-1: Albumin [Mass/volume] in Urine
  • LOINC 32293-3: Albumin [Mass/volume] in Unspecified specimen
  • LOINC: 40599-3: Albumin [Mass/volume] in Peritoneal dialysis fluid
  • LOINC 51693-0: Albumin [Mass/volume] in Pericardial fluid
  • LOINC 54346-2: Albumin [Mass/volume] in Stool
  • LOINC 61195-4: Albumin [Mass/volume] in Serum or Plasma from Fetus
  • LOINC 61196-2: Albumin [Mass/volume] in Urine from Fetus
  • LOINC 2861-3: Albumin [Mass/volume] in Cerebral spinal fluid by Electrophoresis
  • LOINC 2863-9: Albumin [Mass/volume] in Synovial fluid by Electrophoresis
  • LOINC 43212-0: Albumin [Mass/volume] in Body fluid by Electrophoresis
Scenario 5: "Scale" is different:

No additional mappings

Scenario 6: "Component" is different

148 new codes, e.g. LOINC 10338-2, "Barbiturates [Mass/Volume] in Serum or Plasma"

So, based on the mapping for LOINC code 1751-7 "Albumin [Mass/Volume] in Serum or Plasma", we could derive 168  new mappings. Remember that none of these mappings was present in the original dataset, as this is tested during the execution of the software, and excluded when so.

Additional work 

For scenario 6, we realize that our decision to only include "component" values for which there is already a suitable LBTESTCD/LBTEST pair available in the original mapping, may lead to some "missed" new mappings. For example, the LOINC code 10332-5 "Cortisol [Mass/volume] in Serum or Plasma --pre 250 ug corticotropin IM" is rejected as the value for "component" is "Cortisol^pre 250 ug corticotropin IM" has no equivalent LBTESTCD/LBTEST. However, there is a mapping for the first subpart "Cortisol", so we essentially could add it to the mappings, if we find a way to accommodate the second part "pre 250 ug corticotropin IM" in an LB variable. Probably, this should go into LBTPT. This is surely something we want to look into in the near future.

The next step is that the new mappings are curated. This is necessary as not everything can be fully automated. For example, as explained by the CDISC team in the Excel worksheet, the LOINC "Time aspect" either maps to the CDISC variable LBTPT, or to one of the supplemental qualifiers LBPTFL ("Point in Time Flag") or LBPDUR ("Planned Duration"). It is (not yet) clear whether a clear rule for this can be generated.
Another example is the assignment of LBFAST ("Fasting Flag"). Usually, this can be derived from the LOINC "Component" part having two sub-parts, delimited by the "^" character, such as in:
LOINC 14771-0  (Fasting glucose [Moles/volume] in Serum or Plasma - Component="Glucose^post CFst"), which maps to LBTESTCD=GLUC, LBTEST=Glucose, LBSPEC="SERUM OR PLASMA", LBFAST=Y.
It might be that generating such similar mappings can be generated automatically, but we are not sure about that.
Furthermore, we want to exclude duplicates, except for 1:N mappings, such as for the "Ser/Plas" case, for which the original mapping database contains 3 mappings, one for LBSPEC="PLASMA", one for LBSPEC="SERUM" and one for LBSPEC="SERUM OR PLASMA".

And finally, we will need to repeat everything when CDISC publishes the final LOINC-to-SDTM-LB mapping, store the result in a separate database table, and make these additional mappings available through a RESTful web service, as we did already for the draft LOINC-to-SDTM-LB mapping.
As these extra mappings have not been developed by CDISC, and did not underly the same quality assurance, we do not want to mix them up with the by CDISC ones, so our RESTful web service will surely have a parameter to state that also the "extended" mappings need to be searched for. 

Something we also want to work on in future is a mapping between the LOINC "vital sign" codes, and the CDISC-VS domain and its variables. Background of this is that electronic health records do not use CDISC coding for vital signs they use LOINC coding.