Thursday, December 31, 2015

SDTM labels: freedom or slavery?

Everything below applies as well to CDISC SEND and ADaM.

On the last day of this year, I was working on a new software offering for the generation and validation of define.xml files. It is meant for those people and organizations that do not use SDTM-ETL or the SAS Clinical Standards Toolkit or similar, i.e. for people that need to generate their define.xml "post-SDTM-generation". The latter is far from ideal, but is still common in (too) many organizations. The tool is also meant for people who are sick and tired of generating define.xml from an Excel worksheet through a "black box" tool for which there is no documentation, no user-friendly graphical user interface, and produces "garbage define.xml" in many cases. Those who use this tool know what I mean ...

So, when working on the tool, I was confronted with an issue that bothers me already for a number of years: SDTM labels.

In the SDTM-IG, one finds a label for each variable that is at least recommended for the specific domain. Additional standard variables can be added, and for these, one needs to do a lookup in the "Study Data Tabulation Model" document, look up the (prototype) variable there, and take the label from there.
So far so good. Then run the usual validation tool (also used by the FDA) on the define.xml , and you might get a good number of errors "Error: Variable Label in standard domain must match the variable label described in corresponding CDISC implementation guide" (remark that the error message is itself in principle wrong because it is also applied to standard variables that are not in the SDTM-IG but are in the SDTM document). Let the puzzling start!!!

Some of my customers spend hours on such error messages, finally finding out that there was an "uppercase-lowercase" mismatch in a single character in the label. For example "Dosing Frequency Per Interval" versus "Dosing Frequency per Interval" (this is an easy one). Or they found out that the difference was a dot, like in "Character Result/Finding in Std. Format" versus "Character Result/Finding in Std Format".

Let us do the following imaginary experiment: add a dot at the end of each of your labels in your define.xml and then submit it to the FDA. You will probably get your submission back with the message that it did not pass the "data fitness" programm. Is your submission really of bad quality because of the dot at the end of each variable? Did that dot change the meaning of your variables?
I also had customers complaining about that they got errors even when making very minor changes to the SDTM labels because the minor change much better explained the reviewers what the variable was about. Although the minor change considerably improved the quality of their variable description and so of their submission, the validation tool that is used by the FDA said it the other way around.
They (in my opinion correctly) argued that the validation software should not give an error, as the SDTM-IG nowhere states that labels must be provided exactly as given in the IG (I haven't been able to find such a statement either...).

In my opinion, validation tools should be based on "risk evaluation" or "risk assessment". I do not know yet how that would need to work in the case of SDTM and define.xml. I do already see however how we can take a few steps in that direction. One of the things I would like to see for SDTM variable labels is that there is an "label equality assessment" done between the "expected" label (from the SDTM-IG or SDTM standard), and the "actual" label, quantified by a "label equality percentage". So in my new software, when validating the SDTM labels in the define.xml for compliance, I implemented the following stages:
  • the "expected" and "actual" label are (case sensitive) identical. No problem => full compliance
  • the "expected" and "actual" label are (case insensitive) identical, meaning that there is only an "uppercase-lowercase" problem (so semantically, they are equal) => an appropriate warning is generated
  • the "expected" and "actual" label are different => calculate the "likeness" or "equality" number
In the latter case, there might have been a very good reason why the sponsor has changed the label.
So, how do I quantify "how much" the label was changed? I did a bit of research and found the "Levenshtein distance between two Strings": it counts the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one sentence into the other. This (positive integer) number can then easily be transformed into an "equality number" or "likeness" using the following formula:

equality number = 1.0 - (LD / max(string-length(s1),string-length(s2)))

where LD = Levenshtein distance
s1 = first string (i.e. "actual" label)
s2 = second string (i.e. "expected" label)

and is a number between 0.0 (completely different labels) and 1.0 (completely identical labels).

A few examples are given below:

SDTM variable Expected Label Actual Label Equality Number
LBTESTCD Lab Test or Examination Name Laboratory Test or Examination Name 0.80
MBSTRESC Character Result/Finding in Std Format Character Result/Finding in Std. Format 0.97
TADTC Date/Time of Accountability Assessment Date/Time of Drug Accountability Assessment 0.88
MBSTRESC Character Result/Finding in Std Format The quick brown fox jumps over the lazy dog 0.12

As one can see, this "equality number" or "likeness number" much better quantifies how "much alike" the provided label is with respect to the (from the SDTM-IG or SDTM) expected label. Much better than the one implemented in the by the FDA used validation software where the outcome can only be "0" (not identical) or "1" (identical, case sensitive).

And this is the way I implemented it in my software.

Another small step in coming to "smart" validation software tools...

Sunday, November 29, 2015

SDTM - Moving away from files

Reviewers at the FDA always complain about file sizes. Until now, they haven't embraced the new CDISC Dataset-XML standard mostly because the file sizes are usually (but not always) higher than for SAS-XPT. On the other hand, they do not allow us to use zipped Dataset-XML files, although these can be read by several tools (like the open-source "Smart Dataset-XML Viewer") without the need of unzipping first. Even worse, each time a new version of the SDTM-IG comes out, a number of unnecessary derived variables (like EPOCH) is added on request of the FDA, further increasing the file sizes. So they are to blame themselves ...

The first time I was re-thinking this "problem" was during the development of the "Smart Dataset-XML Viewer". During testing the tool with large SDTM files (like QS and LB), I was wondering how reviewers could ever work efficiently when they work with (ten)thousands of rows in any kind of viewer. Even though we added a number of smart features (like one-click jumping to the corresponding row in the DM dataset - try that with the SASViewer...), the amount of information is overwhelming. So we added filtering features ...

Essentially files are very inefficient for large amounts of information. If you want to find a particular piece of information, you first need to read the complete file into your tool...
Large amounts of information should reside in databases (relational or XML or mixed). Databases can easily be indexed for query speed, and tools need only to load the minimal amount of information that is required to do the task. However, only a minor part of the FDA reviewers use a database (like the Janus Clinical Trials Registry), all the other use ... files.

So what are files good for? In first instance, you need them to store computer programms. Also you need them for unstructured information. And you usually need them for transport of information between computers (although (RESTful) web services can do the same). As each SDTM submission (but also ADaM and SEND submissions) needs to be submitted to the FDA as a set of files (using the eCTD folder structure), the first thing the FDA should do (and I think they do) is to load the submission in databases, which can be the Janus-CTR.

As of that point, reviewers should be forbidden to use the submission files as files.
They should only be allowed to retrieve the information they need from the databases or CTR. That would also make their work more efficient, so that patients get safer new drugs faster.

This would also once and for all end the discussion about file sizes.

The SDTM is completely based on the concept of tables and files. SAS-XPT is still required for electronic submissions. SDTM contains large amounts of unnecessary and redundant information. An example is the "test name" (--TEST) which has a 1:1 relation with "test code" (--TESTCD). Test names can however be looked up e.g. using RESTful web services, or by a simple lookup in a database (or even in the define.xml). We urgently need to start trimming the SDTM standard, and remove all redundant and unnecessary variables, as these lead to errors in the data. We urgently need to move away from SAS-XPT for the transport. And the FDA should forbid its reviewers to use "files", and only allow them to use submission data that is in databases.

Friday, November 6, 2015

Making CDISC ODM fit for RESTful web services

ODM exists for about 12 years now, the last version (1.3.1) being published in 2010 which was essentially a minor update of the 1.3.0 version that was published almost 10 years ago (2006).
A lot has changed in the world of informatics since then. In 2006, we were still using SOAP web services, and the very-hard-to-learn HL7-CDA (an implementation of HL7-v3) was just published. It seems like ages ...

Although HL7-CDA has an extremely steep learning curve (I do know - I teach it at the university, and some of my students suffer), it has been the way to exchange electronic health records between different systems from different organizations). But the price was high...

A few years ago, some developers were so dissatisfied with HL7-v3 that they started something rather different. Unexpectedly, their effort was blessed by HL7: HL7-FHIR was born.

When I look at CDISC ODM, it see that it has some things in common with FHIR: reuse of building blocks. In FHIR, you define a patient (resource "Patient") or a health care provider (resource "Practitioner") once, and can reference it many times. Just like the "ref-def" mechanism in ODM.
HL7-CDA doesn't have this at all due to it's very tight binding to the RIM.

There is however also a distinct difference: FHIR has been developed for use with RESTful web services: you can reference a resource that is somewhere else out there, maybe on another machine, maybe at the other end of the world. You just use an HTTP request and get the information. To guarantee privacy and security, you can use OAuth.
In ODM, you can import information from other sources, using the "Include" element and mechanism. However, the latter just tells the system which prior study design must be included (by Study-OID and MetaDataVersion-OID), but not where it is and how that should be done.
In ODM, we e.g. define an ItemDef once (giving it an ID by using the OID attribute) and can reference it several times. The corresponding ItemDef must be within the same "MetaDataVersion", or be included through the "Include" mechanism. The match is made over the OID. For example:

<ItemGroupDef OID="IG.DEMOG" Name="Demographics" Repeating="No">
    <ItemRef ItemOID="IT.BIRTHDATE" Mandatory="Yes"/>
<ItemDef OID="IT.BIRTHDATE" Name="Date of birth" DataType="date" ...

Now, wouldn't it be nice if we could just see an "ItemDef" as a building block that "is somewhere out there" and that we can get from a web service (like an FHIR "resource"). Something like:

<ItemRef ItemRefWS="" Mandatory="Yes" />

When the system encounters an "ItemRefWS" it just triggers a RESTful web service, and obtains an ItemDef object back (this can be an XML snippet).

Let's see this in the context of SHARE. Couldn't we just retrieve a codelist using a web service from SHARE? Something like:

<ItemDef OID="IT.SEX" Name="Sex" DataType="text" ...>
    <CodeListRef CodeListRefWS="" />

where "CodeListRefWS" triggers a web service and retrieves the complete codelist (version 2015-06-26) from the SHARE repository.

Does this make sense? Comments are once again very welcome...

Thursday, August 20, 2015

EPOCH expected in most SDTM domains - why that is nonsense

The FDA is over and over again complaining about the file sizes of SDTM submissions, and forcing sponsors to squeeze these files as much as possible. However, the FDA does not allow us to submit zipped files, but they cannot give us a reasonable explanation why not either.
On the other hand, we see that each time a new SDTM-IG is published, additional derived variables have been added - all on request of the FDA. So no wonder files become so large ...
Some time ago, a lot of new --DY (study day) variables were added and made "expected" and it is now also expected that the "the EPOCH variable is included for every clinical subject-level observation".

Now both --DY as EPOCH are derived. --DY (study day) can easily be calculated from the --DTC (date/time of collection or observation) and the RFSTDTC (reference start date/time) in the DM (demographics) domain. Similarly, EPOCH can easily be calculated from the --DTC and the records for that subject in the SE (Subject Elements) domain.

So why does the FDA then still insist that as well --DY as EPOCH is delivered for each observation although it blows up the size of the datasets? Can't the tools of the FDA reviewers calculate the --DY values and the EPOCH "on the fly"?

Some time ago, we developed the "Smart Dataset-XML Viewer", a free and open-source tool for inspecting SDTM, SEND and ADaM files in the new CDISC Dataset-XML format. The Viewer has a good number of great features for exactly doing what the FDA's tools cannot accomplish. It then was a big surprise to us when we were told that most reviewers choose not to use the viewer during the "FDA Dataset-XML pilot". Most of them preferred the (statistical analysis) tools that they always have been working with.

I recently added the newest feature to the "Smart Dataset-XML viewer" which is the display of the (on-the-fly) lookup of the EPOCH and the ELEMENT on each --DTC value. It took me just two evenings to implement that. What the tool is doing is for each --DTC value, it picks up the USUBJID and then compares the date/time with the SESTDTC (Start Date/Time of Element) and SEENDTC (End Date/Time of Element). When the --DTC value falls between the start and end date, the element code (ETCD) is retrieved as well as the value for EPOCH.

Here is a screenshot of a result (here for an observation in the LB domain):

with a detail snapshot:

showing that for the date/time of specimen collection "2014-01-16T13:17" (January 16th 2014 at 1:17 pm) the study day (LBDY) is 15 (15th day of the study for this subject) and the corresponding Element is "PBO" and the corresponding EPOCH is the "blinded treatment" (other epochs are "SCREENING" and "FOLLOW-UP). These values were NOT taken from the dataset, they were calculated "on-the-fly" from the RFSTDTC and the records for this subject in the SE dataset:

This shows that the obligation of the FDA to add --DY and EPOCH to each record is nonsense.
It can easily be done by viewing or analysis tools. The "on-the-fly" calculation even considerably improves data quality.

This nice little feature (again, programmed in two short evenings time) has some further implications. For that, let us have a look at the AE (adverse events). The originally captured dates are AESTDTC (start date/time of adverse event) and AEENDTC (end date/time of adverse event). That's it.
The FDA also requires to submit AESTDY (study day of start of adverse event) and AEENDY (study day of end of adverse event), unnecessarily blowing up the file size again, and even worse, introducing data redundancy. But what about the EPOCH?
Which EPOCH? That for the start of the adverse event? Of the end of it? Or maybe of the date/time of the collection of the AE? It is not possible to add 3 extra columns with EPOCH, or should we add new variables in SUPPAE, for example STAEPOCH, ENDEPOCH? The FDA doesn't tell us.
But of course, for each of them, the "Smart Dataset-XML Viewer" can calculate them "on-the-fly". Here are some screenshots:

First for AESTDTC:


and for AEDTC:

stating that the AE started on study day 3 (element PBO, epoch BLINDED TREATMENT) and ended on study day 199 (element FOLO, epoch FOLLOW-UP) and was captured on day 23 (element PBO, Epoch BLINDED TREATMENT).

So, with this feature, reviewers can easily find out in what study day and in which element and epoch the adverse event started, ended, and was captured in the EDC system, WITHOUT having the --DY and EPOCH variables in the dataset itself.

I will soon upload the updated "Smart Dataset-XML Viewer" programm and source code to the Sourceforge website from which anyone can download them.

Tuesday, August 4, 2015

Define.xml and stylesheets

I have doubted a long time whether I should write this blog entry or not. The trigger then to do it came from an entry in the OpenCDISC forum stating "Can the indentation of items in the sidebar be controlled for all font sizes? Indentations appear normally for small fonts, but become irregular for larger fonts. Technical note: I am viewing define.xml files in Internet Explorer."

First of all, this has nothing to do with OpenCDISC. Probably however, the writer used an OpenCDISC tool to generate the define.xml after having generated the SDTM files (which I consider bad practice) and then viewed the result using Internet Explorer.
The writer of that entry doesn't even seem to realize that  a stylesheet is used for representing the define as HTML in the browser. For him/her, define.xml is what is seen in the browser.

Define.xml is however much more, it contains the metadata of your submission in XML, and not in HTML. So it can and should be used to validate the submission data themselves. Unfortunately, most validators even don't do that. The argument (sic): "Unfortunately the industry compliance with define.XML standard is not high enough to rely on user-provided metadata".

But today I want to discuss a somewhat different topic: stylesheets.

The define.xml specification comes with a sample XSLT stylesheet that was developed by Lex Jansen (SAS), member of the CDISC define.xml development team. It is one of the best stylesheets I have ever seen. Even though, we regularly read complaints from people that they want it ... different. They do not seem to realize (or don't want to) that this is just a sample stylesheet, and that providing a stylesheet (not necessarily the one developed by Lex) is their own responsibility when submitting data to the FDA. So if they want to have changes to the stylesheet, they should make them themselves.

Now, what is an XSLT stylesheet?
A stylesheet transforms the XML into something else. The "T" in XSLT stands for "Transformation" isn't it? In many cases, the transformation is done to HTML (as in the define-stylesheet), but stylesheets can also transform XML into PDF, CSV, text files, SQL, or other XML...
So what the user sees in the browser when (thinking) he/she is opening the define.xml, is principally not the define.xml, but it is the visualization of the HTML that is generated by the stylesheet starting from the information in the define.xml.
So, essentially (and don't misunderstand me), what-you-see-is-not-what-you-have.

Now, Lex's stylesheet is an extremely good one, and it makes most information that is in the define.xml XML file display in a very-user friendly way - you can trust Lex and the define.xml team.

Transformation means manipulation (in the good sense of the word). One can however also use stylesheets to manipulate data in the bad sense of the word. Now, we are all honest people, and we would never never think about changing the define-stylesheet so that the information seen in the browser does not correctly represent what is in the define.xml XML file itself.

That is where the devil in me starts to speak ...

Let us look at a simple example: the "Key" column that is seen in the table where the variables for each dataset are defined. It looks like (here for the DS domain):

The XSLT for it in the define-stylesheet is:

        <xsl:for-each select="./odm:ItemRef">
        <td class="number"><xsl:value-of select="@KeySequence"/></td>

Let us now make a small change to the stylesheet:

        <!-- added J.Aerts -->
        <xsl:variable name="MAXKEYSEQUENCE" select="max(./odm:ItemRef/@KeySequence)"/>
        <xsl:variable name="MAXKEYSEQUENCEPLUSONE" select="$MAXKEYSEQUENCE+1"/>
        <!-- end of addition J.Aerts -->
        <xsl:for-each select="./odm:ItemRef">
<!-- <td class="number"><xsl:value-of select="@KeySequence"/></td> -->
            <xsl:when test="@KeySequence != ''">
                <td class="number"><xsl:value-of select="$MAXKEYSEQUENCEPLUSONE - @KeySequence"/></td>

And what you then see in the browser is:

Do you see the difference? The values for the "Key" have been reversed! I.e. the lowest key number has become the highest and the highest has become the lowest!
But we did not change anything in the define.xml file itself isn't it? We only made a minor change to the stylesheet. Although this is a pretty harmlous example, it demonstrates that the result of a stylesheet does not necessarily represent the source XML data.
Again, we are honest people, and we would never never do something like this, and especially not when submitting data to the FDA.

So what do we learn from this?

- stylesheets should be validated. Does a stylesheet really truly visualize the data from the define.xml?
- it is the sponsor's responsibility (and not the one of Lex or of CDISC) to provide a stylesheet that truly visualizes what is in the define.xml
- the FDA should use its own stylesheets
- what you see in the browser (when a stylesheet is used), is not the define.xml
- the define.xml is a machine readable XML file defining the metadata for a submission and should be used as such
- what you see in the browser is just a human-friendly representation of what is in the define.xml - decisions should not be based on this "view"
- people should stop thinking about define.xml being a replacement for define.pdf
- in submission teams at sponsor companies, there should be at least 1-2 persons with good XML knowledge (it's easy, my students learn it in just 2 x 1.5 hours)

Comments are as always extremely welcome!

Thursday, February 12, 2015

Rule FDAC084 is just damned wrong

The FDA has recently published a set of "SDTM rules", unfortunately in Excel format, which is not machine-executable. So I started working on an XQuery representation, which will soon be available through a set of web services. You can already find some examples in my previous blog entries.

When working on these rules, I found that:
  • about 10% of them is just damned wrong
  • another about 10% is ununderstandable, even for people with a lot of experience in SDTM and define.xml
 The most notorious of the "just damned wrong" rules is surely rule FDAC0154: "Missing value for --ORRESU, when --ORRES is provided".
As we all know, there are so many tests for which there are no units for the results, just to name a few:
  • pH is dimensionless
  • all qualitative tests have no units. For example: presence of Ketones in Urine by Test strip (LOINC 24356-8)
 Today, I want to discuss another rule however, which is a consequence of the SDTM myth that the combination of LBTESTCD, LBCAT, LBSPEC, and LBMETHOD uniquely describes a lab test.

The rule FDAC084 sounds: "Standard Units (--STRESU) must be consistent for all records with same Short Name of Measurement, Test or Examination (--TESTCD), Category (--CAT), Specimen Type (--SPEC) and Method of Test or Examination (--METHOD)"

A quick search trough the LOINC database shows that this rule is just damned wrong.

Just take the following combination:

One quickly finds following tests for this combination:
  • LOINC=25428-4  Glucose [Presence] in Urine by test strip
    and the designation "ordinal".
    So this test has no units
  • LOINC=50555-2 Glucose [Presence] in Urine by automated test strip
    with typical values being 1+ to 4+ and "negative".
    So again: no units
  • LOINC=5792-7  Glucose [Mass/​volume] in Urine by test strip
    with as typical unit: mg/dL
  • LOINC=22705-8 Glucose [Moles/volume] in Urine by test strip
    with as typical unit: mmol/L
 So essentially 4 different tests, all with the same combination of LBTESTCD, LBCAT, LBSPEC and  LBMETHOD.
2 of these tests have no units at all (25284-4 and 50555-2), with the two others having different units.

So what can we learn from these examples?
  • The combination of CDISC LBTESTCD, LBCAT, LBSPEC and  LBMETHOD does not uniquely describe a lab test
  • Even worse, when looking at other LB variables that are at least expected, there is no combination that can possibly ever uniquely describe lab tests
  • The rule FDAC084 is nonsense
  • The only method to uniquely identify lab tests is the LOINC code. Unfortunately LBLOINC is "permissible", with the consequence that you almost never find it in real submissions, and putting the LOINC code itself in LBTESTCD is not allowed.