Thursday, May 14, 2020

Generating SDTM datasets directly from HL7-FHIR for Covid-19 clinical trials. Part 1: methodology


Introduction

Two things the Covid-19 pandemic made clear for clinical research is that firstly, remote clinical trials must become the standard, and secondly, that it takes too long to develop a new treatment or vaccine. For the former, already a lot is being done, such as the IMI project "Trials@Home", funded by the European Union and the European Pharmaceutical Industry (EFPIA), and to which I am a technical advisor (I acquired the project for the University of Applied Sciences FH Joanneum in Graz when I was still a professor there).

For the latter, it also becomes obvious that we need to use much more data from electronic health records (also called "Real World Data"), not only to be able to have many more subjects in the study (e.g. eligibility criteria checking), but also to build a control group for comparison of patients treated using the study drug or therapy. Especially in the times of Covid-19, it becomes ethically questionable to have a placebo group with subjects that do not get a treatment. Of course, this is not new, as in oncology, where "standard treatment" is usually used for the control group.

In the last 10 years, a lot of progress was made in using EHRs for retrieval of clinical research data. In most cases however, this meant that content of the CRFs was mapped to the content of the EHRs, which is very tedious, and must be repeated for each individual study. It also does not avoid the difficult step of mapping and converting the CRF data, even when stored in an EDC system, to the CDISC submission SDTM standard, which is required by FDA and PMDA for submissions to obtain a marketing authorization. After all, the FDA and PMDA are not interested directly in the captured source data themselves, they want to obtain the information categorized in so-called CDISC domains datasets, and in tabular format, and with lots of derived data points.
Another caveat so far was also that EHR systems considerable differ, and an interface must be written for each EHR system.

HL7-FHIR as a game changer

All that changed when more and more EHR systems started providing an HL7-FHIR interface "out of the box". Some of these EHR systems are even open source, such as the HAPI-FHIR system. One of the advantages of the HL7-FHIR standard is that it offers a highly standardized API for use with RESTful webservices. As such, anyone with a minimal knowledge of how to write a RESTful client, can develop and generate applications that use data from the EHR system. Therefore, FHIR has become a real "game changer", and thousands of applications already exist that use the FHIR API.

Use of FHIR in clinical research

 Some have seen FHIR as a serious competitor to CDISCstandards such as ODM and SDTM, but this is not correct. Information captured in FHIR resources is usually "event driven", i.e. the patient comes to the doctor or hospital with a problem, and the "care plan" is regularly adapted, depending on findings from examinations and e.g. laboratory information. In clinical research, the "plan" is governed by the study protocol, and translated into "visits", with "forms" and "questions" or "data points", mostly baked into an EDC system. Exchange of information is then usually done in CDISC ODM format. FDA and PMDA are however not interested in ODM exports, they want to have all data categorized and submitted as SDTM and after analysis, also as ADaMdatasets, and in tabular format.
FHIR however is not tabular at all, FHIR is about "linked data", in this case about "linked resources". So, with the current preference of the regulatory authorities to use categorized data in tabular format, it is pretty unlikely that FDA and PMDA will one day require FHIR as a format for submissions. 

What we can however already do right now, is to deliver theFHIR source data point together with the SDTM record, as has been demonstrated several times before. This would however require FDA and  PMDA to finally move from the outdated transport SAS-XPT format to a modern transport format such as CDISC Dataset-XML or a JSON-based format. For SDTM, SEND and ADaM submissions in Dataset-XML format, even an open source viewer is available that also enables to visualize embedded source records in FHIR format.

But what if we could generate the by the regulatory required SDTM datasets directly from a FHIR-EHR repository? This would already allow to win considerable time for "control group" datasets. But even for patients in a controlled clinical study, HL7 has developed a few special resources, such as the "ResearchStudy" and "ResearchSubject" resources. The former defines all the major characteristics of the study and also links to a "PlanDefinition", which can be seen as the electronic version of the study protocol. The "ResearchSubject" resource contains information about the subject, such as the start and end of participation, reference to the electronic confirmed consent document, the trial arm assigned to, and a patient identifier allowing to link to all other medical information resources, such as medications, findings, medical history. 

So, essentially, if all the medical information of a clinical research subject is in the EHR system, and can be accessed using the FHIR API, it should in principal be possible to generate almost all SDTM datasets fully automatically and "on the fly".

Connecting FHIR to SDTM

FHIR is essentially consisting of "resources" which can be compared to CDISC SDTM "domains". Well, sort of Whereas some resources map pretty well 1:1 to SDTM domains, such as the Patient resource (to SDTM DM "demographics" domain), the MedicationAdministration resource (to the CM "concomitant medications" domain), the AdverseEvent resource, there are also a lot of differences.

The major problem lies in mapping the FHIR Observation resource. It corresponds to the full set of SDTM "Findings" domains. In SDTMIG v.3.3, there are 28 "Findings" domains, for which there is only FHIR "Observation" resource. As "Findings" datasets consist of the bulk of an electronic regulatory submission, how can we know which "Observation" data point must go into which SDTM domain dataset? This categorization step is already now the most difficult step when generating SDTM datasets: the most asked question in several of the public forums on SDTM is: "where, in which domain (and how) do I put datapoint XYZ?". FHIR Observation instances do not include that information. Sometimes, one can use the "category" information, but this is not always the case, and it is not always populated. Also, we see that in SDTM, domains are often split between versions of the Implementation Guide, and instructions what to put where is changing. For example, COVID-19 lab test results that measure the presence or amount of viral RNA are now to be put into MB (microbiology) datasets, whereas in the past, they were envisaged to go into LB (laboratory).

FHIR observations and LOINC codes

How does FHIR then distinguish between the type and/or specifics of individual observations? Although there is no FHIR obligation to use LOINC codes, almost all EHR systems differentiate between the type of the observation by the LOINC code for the test that leads to the observation.
LOINC is a worldwide standard for defining tests in a very exact way in healthcare and assigns a unique code to each of them. The latest version (2.67) has over 92,000 tests defined. Many of them are lab tests (including virology), but there are also many test codes for vital signs, for ECG, for standardized questionnaires, etc.. LOINC is much more specific than CDISC TESTCD codes, the major different being that LOINC is precoordinated whereas CDISC-CT is post-coordinated. So, in order to find out to which SDTM domain a FHIR "Observation" data point belongs, one would need to map all LOINC codes to CDISC controlled terminology, i.e. for each of them decide to which SDTM domain it maps to, and then assign CDISC-CT values for TESTCD, --TEST, --SPEC, --LOC, --LAT, --EVINT, etc..
As there are over 92,000 LOINC codes, one can already see this would be a gigantic task, to be repeated each time that a new SDTM-IG version (or even CDISC-CT version) is published, due to further "specialization" of SDTM domains with each new SDTM-IG version.

Now, and very unfortunately, most people and organizations in the CDISC community, still see LOINC codes as something that is a burden rather than an opportunity, and only to be looked after in case the FDA requires the LOINC code in a submission, as has recently become the case for the LB domain.
As a reaction on this FDA requirement, CDISC developed a mapping between the LOINC codes of the 1,400 most popular LOINC codes and the LB domain variables and controlled terminology.
This mapping already helps (limited to these 1,400 codes) to automatically generate SDTM-LB datasets from a FHIR repository for all patients in a clinical trial. This was already demonstrated during the last CDISC European Interchange

But a clinical trial submission is of course much more than just lab data. Our investigations show that it is already easy to generate an almost complete DM (Demographics) data set, mostly except for race and ethnicity data which is not always present in the FHIR records, and information about planned and actual arm. These can however in future be retrieved from the FHIR ResearchSubject resource instances.
So, how can we extend this (limited) mapping between LOINC codes and CDISC domains and controlled terminology? I will elucidate more about our current efforts in that field in the next blog.

Extending the mapping for SARS-Cov-2 new LOINC tests and codes

In view of the Covid-19 pandemic and crisis, LOINC recently started developing and publishing special "prerelease" codes for tests related to SARS-CoV-2. When discussing these with the team that generated the CDISC "Interim UserGuide for COVID-19", we became aware that these newly developed codes need to be mapped to the MB domain, and not to the LB domain. We got a lot of help from the team, for which we are very grateful!
Also for these codes, a RESTful web service was developed and made available. This mapping is continuously being extended, as LOINC is regularly adding new LOINC codes for SARS-CoV-2 tests.



In the next blog, I will show some of our results, such as automatically generated SDTM DM, MB and LB datasets generated for 131 Covid-19 patients from the SmileCDR synthetic Covid-19 FHIR repository. Goal of this effort to create code for generating as many SDTM datasets as possible completely automatically from the FHIR repository. If we make good progress, we will apply for presenting these results at the International CDISC Interchange in October.

 





No comments:

Post a Comment