Tuesday, August 23, 2016

SDTM derails: new derived variables

The "Study Data Tabulation Model v.1.5" has recently been published as part of the new SEND standard v.3.1. The SDTM Implementation Guide (SDTM-IG) describing how the SDTM model v.1.5 should be implemented in the case of human studies will probably be released for public review in the next weeks.

A quick view on the "Changes from v.1.4 to v.1.5" reveals that some new variables have been added to the model, including some "derived" ones, and some that essentially contain metadata.
However, SDTM, according to its own principles, should not contain derived data, and metadata should go into the define.xml, not into the datasets themselves.

The most obvious new variable is the --LOBXFL (Last Observation Before Exposure Flag) which can only have the value Y or null. It's definition is: "Operationally-derived indicator used to identify the last non-missing value prior to RFXSTDTC" (the latter is the datetime of first study drug/treatment exposure).
This variable is clearly "derived" and should not be in SDTM. So why is it there?
The answer is found in the latest version of the FDA "Study Data Technical Conformance Guide v.3.1" (Juli 2016) stating: "Baseline flags (e.g., last non-missing value prior to first dose) for Laboratory results, Vital Signs, ECG, Pharmacokinetic Concentrations, and Microbiology results. Currently, for SDTM, baseline flags should be submitted if the data were collected or can be derived".
The SDTM development team seems to have taken the occasion to make this a new variable, with the possibility to phase out the --BLFL variable which was not well defined. 

In my opinion, derived variables (such as EPOCH, --DY, etc.) should be calculated by the review tools at the FDA, and not be submitted by sponsors. The reason for this is that such variables jeopardize the model (data redundancy) and lead to errors. For example, I have seen submissions where up to 40% of the --DY values were incorrect! I expect that the same will happen for –LOBXFL in future submissions. This may be highly problematic as reviewers will rely on data that is possibly erroneous due to derivation problems, instead of relying on their own "on-the-fly" derivation (trust is good, control is better).

For example, suppose I am testing a new blood pressure lowering agent, and have following values: 140/95, 120/80 and 122/82, and erroneously, the second one is assigned by the sponsor as "last non-missing value prior to dose" (VSLOBXFL=Y) instead of the first one. Can you imagine what can happen?

I haven't tried yet, but I guess that I can add a feature to the "Smart Dataset-XML Viewer" that highlights the records that contain the last value before exposure by finding it "on the fly". As on other occasions, I think I can program that in maybe 1-2 evenings (see here) ). Now I am not a super-programmer, so I wonder why the FDA (with much more resources than I have) were not able to realize such simple features in their tools in the last 20 years. 

Also following variables have been added: --ORREF (Reference Result in Original Units), --STREFC (Reference Result in Standard Format), --STREFN (Numeric Reference Result in Std Units).
I presume the "origin" in these cases can be "assigned" (but than it is metadata which i.m.o. belongs into the define.xml), or "derived". The document gives the following example: "value from predicted normal value in spirometry tests".
Now I worked some time in this area, and know that such values are usually derived from age and sex of the subject (see e.g. https://vitalograph.co.uk/resources/ers-normal-values), or sometimes using a few more variables (additionally, height, weight, … - see e.g. http://dynamicmt.com/dataform3.html). In such a case, it would be better if the reviewer can generate these reference values himself (so not trust that the sponsor has provided the correct value), e.g. by using a RESTful web service. We did already develop such a RESTful webservice for LOINC codes, and implemented it in the "Smart Dataset-XMLViewer", and I guess it would also be very simple to generate similar RESTful web services for normal values in spirometry.

In case such a reference value is independent from the subject itself (e.g. a fixed value for the specific test), I think it is to be considered as metadata, and should go into the define.xml. I realize that the define.xml needs to be extended for that, based on the "ReferenceData" element in the core ODM.

I will try to add the new feature "highlight last observation before exposure" in the "Smart Dataset-XML Viewer" next week (first taking a few days of vacation…)

Sunday, May 8, 2016

def:WhereClause in XQuery

Today, I worked on the PMDA rule SDTM-CT2005: "Variable should be populated with terms from its CDISC controlled terminology codelist, when its value level condition is met"

This is a rule about valuelevel data and associated codelist, for example that when VSTEST=FRMSIZE, VSORRES must be one of "SMALL", "MEDIUM" and "LARGE".
Or "VSORRESU must be "cm" when VSTESTCD=HEIGHT and DM-COUNTRY=CAN, MEX, GER, or FRA and must be "inch" when VSTESTCD=HEIGHT and DM-COUNTRY=USA.

The latter is of course special as it goes over the boundaries of domain, and thus files. When you  however have all your submissions in a native XML database (as I recommended to the FDA, but no reaction at all sofar...) this rule shouldn't be too hard to implement.

We are currently implementing all validation rules of CDISC, the FDA and PMDA in XQuery, the open, vendor-neutral, W3C standard for quering XML documents and databases, and thus also Dataset-XML submissions.

The challenging in this rule is that one needs to translate the contents of def:WhereClauseDef elements in the define.xml, like:

with this "WhereClauseDef" referenced from a def:ValueList:

applicable to the variable SCORRES as defined by;

So, how do we translate the "def:WhereClauseDef" into an XQuery statement? Of course the XQuery script can read the "def:WhereClauseDef" and the "RangeCheck" element in it, but it requires a "where" FLWOR expression like:

where $itemgroupdata/odm:ItemData[@ItemOID="IT.SC.SCTESTCD" and @Value="MARISTAT"]

So I wrote an XQuery function that accepts a def:ValueListDef node and returns a string which essentially is an XPath. Here it is:

The function is not perfect yet, it works well for the simple case that there is only one "RangeCheck" within the "def:ValueListDef" and the "comparator is "EQ" or "NE" and the check value is a string (the most common case). It doesn't work yet for more complicated cases - but I am working on it...

The function returns a string, which is essentially XQuery source code, but even XQuery needs executable code. Fortunately, there is the "util:eval" function (xdmp:eval in xquery:eval in BaseX) which takes a string which is XQuery code itself as an argument and evaluates it. In our validation script this looks like:

What this code snippet does, is that it iterates over all "ItemRef" child elements of a def:ValueListDef element, picks up the corresponding "def:WhereClauseDef" element, which is translated into an XQuery snippet and evaluated on the current "SCORRES". If the XQuery returns an answer (in this case an "ItemData" element), this means that the condition is applicable to the current record.

In the next step, it is then checked whether there is a codelist for SCORRES at the value level. For example if the SCTESTCD=MARISTAT, then the ItemDef "IT.SC.SCTESTCD" is applicable for which there is an associated CodeList "CL.MARISTAT" with allowed values "DIVORSED", "DOMESTIC PARTNER", "LEGALLY SEPARATED", "MARRIED", "NEVER MARRIED" and "WIDOWED".
If the actual value of the current data point is not in the valuelevel codelist (and there is such a codelist), an error is produced:

The complete code for this rule PMDA-CT2005 can be found at:


I will of course further refine this XQuery function, especially for multiple RangeCheck child elements and for the "IN" and "NOTIN" comparators. When finished, I will again publish that code.

If you would like to contribute to the development of validation rules in the vendor-neutral XQuery language, just please let me know.

Wednesday, January 20, 2016

Airport security, toothpaste and validation rules

Last week, I traveled to Geneva for an E3C (CDISC European Coordination Committee) meeting to prepare the upcoming CDISC Interchange in Vienna.
At the airport, my almost-empty tube of toothpaste was confiscated. Reason: although there was only about 10ml of toothpaste in it, the tube was marked to be 125ml, which was 25ml of non-volume too much.

Conclusion: "10 ml of toothpaste in a 125ml tube is much more dangerous than 99ml of nitroglycerin in an 100ml tube". Logic, isn't it? I also had a 250g pack of filled (with liquid) chocolate in my hand luggage, but that was not confiscated.

I checked the rules, and indeed, you may not take an (even empty) 125ml tube through security.

Why such illogical rules? The only thing I can imagine is that such (stupid) rules can also be handled by stupid people. "Swich off your brain"! Do not force security people to have to think, and surely not do allow them to do risk assessment...

So, what does this all have to do with CDISC validation rules? Everything.

Yesterday,  I validated an SDTM dataset that I consider completely valid (it was published by CDISC and used in the second FDA SDTM pilot). For SUPPLB, I got:

For each record, I got 2 errors, one saying that there are unexpected characters, one saying that the value if IDVARVAL does not point to a record in LB.xpt.

Imagine that you show these error messages to your regulatory people - I think they would go crazy...

So, what happened?

When looking into the validation results details, I found e.g.:

I presume, that IDVARVAL was originally programmed to be an integer, and then transformed to "char", as this is a requirement for IDVARVAL. During this transformation, the text was left-padded with blanks, instead of right-padded (remark that when using Dataset-XML, there would be no padding at all). So the validation tool finds a blank as the first character, and marks this as an error.
In the SAS Viewer however, it is however presented as:

Everything seeming to be OK (the viewer seems to be smarter than the validation tool...).

Even worse is that the consequence of this is that the validator also marks this is a "invalid referenced record", because it isn't smart enough to understand that e.g. "       6" is semantically the same as "6".
So, due to a stupid rule (FDAC216), implemented in a stupid (strict) way, I got 64403 warnings and 64403 errors. Thank you!

Back to our toothpaste and risk assessment. What is the risk that a modern software cannot connect the SUPPLB records to their corresponding LB records because of this padding? Almost zero isn't it? In modern software, the first thing that is done when reading a string from file, is trimming it (removing blanks at start and end). At least I always do so in my software. I presume SAS software or any other modern review software will exactly do the same.

What I mean is that we should start thinking about validation as being risk assessment. It is of course however much easier to develop stupid rules, assuming that people doing clinical research are stupid and that their software tools are dumm. Do we accept such insults?

Thursday, December 31, 2015

SDTM labels: freedom or slavery?

Everything below applies as well to CDISC SEND and ADaM.

On the last day of this year, I was working on a new software offering for the generation and validation of define.xml files. It is meant for those people and organizations that do not use SDTM-ETL or the SAS Clinical Standards Toolkit or similar, i.e. for people that need to generate their define.xml "post-SDTM-generation". The latter is far from ideal, but is still common in (too) many organizations. The tool is also meant for people who are sick and tired of generating define.xml from an Excel worksheet through a "black box" tool for which there is no documentation, no user-friendly graphical user interface, and produces "garbage define.xml" in many cases. Those who use this tool know what I mean ...

So, when working on the tool, I was confronted with an issue that bothers me already for a number of years: SDTM labels.

In the SDTM-IG, one finds a label for each variable that is at least recommended for the specific domain. Additional standard variables can be added, and for these, one needs to do a lookup in the "Study Data Tabulation Model" document, look up the (prototype) variable there, and take the label from there.
So far so good. Then run the usual validation tool (also used by the FDA) on the define.xml , and you might get a good number of errors "Error: Variable Label in standard domain must match the variable label described in corresponding CDISC implementation guide" (remark that the error message is itself in principle wrong because it is also applied to standard variables that are not in the SDTM-IG but are in the SDTM document). Let the puzzling start!!!

Some of my customers spend hours on such error messages, finally finding out that there was an "uppercase-lowercase" mismatch in a single character in the label. For example "Dosing Frequency Per Interval" versus "Dosing Frequency per Interval" (this is an easy one). Or they found out that the difference was a dot, like in "Character Result/Finding in Std. Format" versus "Character Result/Finding in Std Format".

Let us do the following imaginary experiment: add a dot at the end of each of your labels in your define.xml and then submit it to the FDA. You will probably get your submission back with the message that it did not pass the "data fitness" programm. Is your submission really of bad quality because of the dot at the end of each variable? Did that dot change the meaning of your variables?
I also had customers complaining about that they got errors even when making very minor changes to the SDTM labels because the minor change much better explained the reviewers what the variable was about. Although the minor change considerably improved the quality of their variable description and so of their submission, the validation tool that is used by the FDA said it the other way around.
They (in my opinion correctly) argued that the validation software should not give an error, as the SDTM-IG nowhere states that labels must be provided exactly as given in the IG (I haven't been able to find such a statement either...).

In my opinion, validation tools should be based on "risk evaluation" or "risk assessment". I do not know yet how that would need to work in the case of SDTM and define.xml. I do already see however how we can take a few steps in that direction. One of the things I would like to see for SDTM variable labels is that there is an "label equality assessment" done between the "expected" label (from the SDTM-IG or SDTM standard), and the "actual" label, quantified by a "label equality percentage". So in my new software, when validating the SDTM labels in the define.xml for compliance, I implemented the following stages:
  • the "expected" and "actual" label are (case sensitive) identical. No problem => full compliance
  • the "expected" and "actual" label are (case insensitive) identical, meaning that there is only an "uppercase-lowercase" problem (so semantically, they are equal) => an appropriate warning is generated
  • the "expected" and "actual" label are different => calculate the "likeness" or "equality" number
In the latter case, there might have been a very good reason why the sponsor has changed the label.
So, how do I quantify "how much" the label was changed? I did a bit of research and found the "Levenshtein distance between two Strings": it counts the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one sentence into the other. This (positive integer) number can then easily be transformed into an "equality number" or "likeness" using the following formula:

equality number = 1.0 - (LD / max(string-length(s1),string-length(s2)))

where LD = Levenshtein distance
s1 = first string (i.e. "actual" label)
s2 = second string (i.e. "expected" label)

and is a number between 0.0 (completely different labels) and 1.0 (completely identical labels).

A few examples are given below:

SDTM variable Expected Label Actual Label Equality Number
LBTESTCD Lab Test or Examination Name Laboratory Test or Examination Name 0.80
MBSTRESC Character Result/Finding in Std Format Character Result/Finding in Std. Format 0.97
TADTC Date/Time of Accountability Assessment Date/Time of Drug Accountability Assessment 0.88
MBSTRESC Character Result/Finding in Std Format The quick brown fox jumps over the lazy dog 0.12

As one can see, this "equality number" or "likeness number" much better quantifies how "much alike" the provided label is with respect to the (from the SDTM-IG or SDTM) expected label. Much better than the one implemented in the by the FDA used validation software where the outcome can only be "0" (not identical) or "1" (identical, case sensitive).

And this is the way I implemented it in my software.

Another small step in coming to "smart" validation software tools...

Sunday, November 29, 2015

SDTM - Moving away from files

Reviewers at the FDA always complain about file sizes. Until now, they haven't embraced the new CDISC Dataset-XML standard mostly because the file sizes are usually (but not always) higher than for SAS-XPT. On the other hand, they do not allow us to use zipped Dataset-XML files, although these can be read by several tools (like the open-source "Smart Dataset-XML Viewer") without the need of unzipping first. Even worse, each time a new version of the SDTM-IG comes out, a number of unnecessary derived variables (like EPOCH) is added on request of the FDA, further increasing the file sizes. So they are to blame themselves ...

The first time I was re-thinking this "problem" was during the development of the "Smart Dataset-XML Viewer". During testing the tool with large SDTM files (like QS and LB), I was wondering how reviewers could ever work efficiently when they work with (ten)thousands of rows in any kind of viewer. Even though we added a number of smart features (like one-click jumping to the corresponding row in the DM dataset - try that with the SASViewer...), the amount of information is overwhelming. So we added filtering features ...

Essentially files are very inefficient for large amounts of information. If you want to find a particular piece of information, you first need to read the complete file into your tool...
Large amounts of information should reside in databases (relational or XML or mixed). Databases can easily be indexed for query speed, and tools need only to load the minimal amount of information that is required to do the task. However, only a minor part of the FDA reviewers use a database (like the Janus Clinical Trials Registry), all the other use ... files.

So what are files good for? In first instance, you need them to store computer programms. Also you need them for unstructured information. And you usually need them for transport of information between computers (although (RESTful) web services can do the same). As each SDTM submission (but also ADaM and SEND submissions) needs to be submitted to the FDA as a set of files (using the eCTD folder structure), the first thing the FDA should do (and I think they do) is to load the submission in databases, which can be the Janus-CTR.

As of that point, reviewers should be forbidden to use the submission files as files.
They should only be allowed to retrieve the information they need from the databases or CTR. That would also make their work more efficient, so that patients get safer new drugs faster.

This would also once and for all end the discussion about file sizes.

The SDTM is completely based on the concept of tables and files. SAS-XPT is still required for electronic submissions. SDTM contains large amounts of unnecessary and redundant information. An example is the "test name" (--TEST) which has a 1:1 relation with "test code" (--TESTCD). Test names can however be looked up e.g. using RESTful web services, or by a simple lookup in a database (or even in the define.xml). We urgently need to start trimming the SDTM standard, and remove all redundant and unnecessary variables, as these lead to errors in the data. We urgently need to move away from SAS-XPT for the transport. And the FDA should forbid its reviewers to use "files", and only allow them to use submission data that is in databases.