Sunday, November 29, 2015

SDTM - Moving away from files

Reviewers at the FDA always complain about file sizes. Until now, they haven't embraced the new CDISC Dataset-XML standard mostly because the file sizes are usually (but not always) higher than for SAS-XPT. On the other hand, they do not allow us to use zipped Dataset-XML files, although these can be read by several tools (like the open-source "Smart Dataset-XML Viewer") without the need of unzipping first. Even worse, each time a new version of the SDTM-IG comes out, a number of unnecessary derived variables (like EPOCH) is added on request of the FDA, further increasing the file sizes. So they are to blame themselves ...

The first time I was re-thinking this "problem" was during the development of the "Smart Dataset-XML Viewer". During testing the tool with large SDTM files (like QS and LB), I was wondering how reviewers could ever work efficiently when they work with (ten)thousands of rows in any kind of viewer. Even though we added a number of smart features (like one-click jumping to the corresponding row in the DM dataset - try that with the SASViewer...), the amount of information is overwhelming. So we added filtering features ...

Essentially files are very inefficient for large amounts of information. If you want to find a particular piece of information, you first need to read the complete file into your tool...
Large amounts of information should reside in databases (relational or XML or mixed). Databases can easily be indexed for query speed, and tools need only to load the minimal amount of information that is required to do the task. However, only a minor part of the FDA reviewers use a database (like the Janus Clinical Trials Registry), all the other use ... files.

So what are files good for? In first instance, you need them to store computer programms. Also you need them for unstructured information. And you usually need them for transport of information between computers (although (RESTful) web services can do the same). As each SDTM submission (but also ADaM and SEND submissions) needs to be submitted to the FDA as a set of files (using the eCTD folder structure), the first thing the FDA should do (and I think they do) is to load the submission in databases, which can be the Janus-CTR.

As of that point, reviewers should be forbidden to use the submission files as files.
They should only be allowed to retrieve the information they need from the databases or CTR. That would also make their work more efficient, so that patients get safer new drugs faster.

This would also once and for all end the discussion about file sizes.

The SDTM is completely based on the concept of tables and files. SAS-XPT is still required for electronic submissions. SDTM contains large amounts of unnecessary and redundant information. An example is the "test name" (--TEST) which has a 1:1 relation with "test code" (--TESTCD). Test names can however be looked up e.g. using RESTful web services, or by a simple lookup in a database (or even in the define.xml). We urgently need to start trimming the SDTM standard, and remove all redundant and unnecessary variables, as these lead to errors in the data. We urgently need to move away from SAS-XPT for the transport. And the FDA should forbid its reviewers to use "files", and only allow them to use submission data that is in databases.

Friday, November 6, 2015

Making CDISC ODM fit for RESTful web services

ODM exists for about 12 years now, the last version (1.3.1) being published in 2010 which was essentially a minor update of the 1.3.0 version that was published almost 10 years ago (2006).
A lot has changed in the world of informatics since then. In 2006, we were still using SOAP web services, and the very-hard-to-learn HL7-CDA (an implementation of HL7-v3) was just published. It seems like ages ...

Although HL7-CDA has an extremely steep learning curve (I do know - I teach it at the university, and some of my students suffer), it has been the way to exchange electronic health records between different systems from different organizations). But the price was high...

A few years ago, some developers were so dissatisfied with HL7-v3 that they started something rather different. Unexpectedly, their effort was blessed by HL7: HL7-FHIR was born.

When I look at CDISC ODM, it see that it has some things in common with FHIR: reuse of building blocks. In FHIR, you define a patient (resource "Patient") or a health care provider (resource "Practitioner") once, and can reference it many times. Just like the "ref-def" mechanism in ODM.
HL7-CDA doesn't have this at all due to it's very tight binding to the RIM.

There is however also a distinct difference: FHIR has been developed for use with RESTful web services: you can reference a resource that is somewhere else out there, maybe on another machine, maybe at the other end of the world. You just use an HTTP request and get the information. To guarantee privacy and security, you can use OAuth.
In ODM, you can import information from other sources, using the "Include" element and mechanism. However, the latter just tells the system which prior study design must be included (by Study-OID and MetaDataVersion-OID), but not where it is and how that should be done.
In ODM, we e.g. define an ItemDef once (giving it an ID by using the OID attribute) and can reference it several times. The corresponding ItemDef must be within the same "MetaDataVersion", or be included through the "Include" mechanism. The match is made over the OID. For example:

<ItemGroupDef OID="IG.DEMOG" Name="Demographics" Repeating="No">
    <ItemRef ItemOID="IT.BIRTHDATE" Mandatory="Yes"/>
<ItemDef OID="IT.BIRTHDATE" Name="Date of birth" DataType="date" ...

Now, wouldn't it be nice if we could just see an "ItemDef" as a building block that "is somewhere out there" and that we can get from a web service (like an FHIR "resource"). Something like:

<ItemRef ItemRefWS="" Mandatory="Yes" />

When the system encounters an "ItemRefWS" it just triggers a RESTful web service, and obtains an ItemDef object back (this can be an XML snippet).

Let's see this in the context of SHARE. Couldn't we just retrieve a codelist using a web service from SHARE? Something like:

<ItemDef OID="IT.SEX" Name="Sex" DataType="text" ...>
    <CodeListRef CodeListRefWS="" />

where "CodeListRefWS" triggers a web service and retrieves the complete codelist (version 2015-06-26) from the SHARE repository.

Does this make sense? Comments are once again very welcome...