Thursday, November 20, 2014

FDA publishes Study Data Validation Rules

The FDA recently published its "Study Data Validation Rules" ( for SDTM and SEND.
Unfortunately the rules come as a set of Excel files, so not vendor neutral (Excel is a product of the company Microsoft) and the rules themselves are unfortunately not machine-readable nor machine-executable.

A snapshot from the Excel file shows how the rules are defined:

Rule #67 saying that the value of "ARMCD" in the DM, TA and TV dataset should not exceed 20 characters in length.

This one is clear, but other of the over 300 rules are harder to interprete. What about:

"<Variable Label> (<Variable Name>) variable values should be populated with terms found in '<Codelist Name>' (<NCI Code>) CDISC controlled terminology codelist. New terms can be added as long as they are not duplicates, synonyms or subsets of existing standard terms."?

Anyway - not machine readable nor machine executable.

Now, many of you will say: "Wait a minute Jozef, we cannot expect the FDA to provide validation source code for different languages like Java, C#, etc.".

This is where XML comes in. Dataset-XML was recently developed to replace SAS-XPT so that we can take advantage of what XML offers us.
Now there is a W3C language for validating information in XML files, named Schematron. Schematron is an open, vendor-neutral standard, and very easy to implement. Unfortunately, it cannot (yet) - as far as I know - validate files that need information from other files, such as from the define.xml file. If you would "copy" the define.xml file into each Dataset-XML for the same submission, we could use Schematron. So as soon as Dataset-XML is accepted by the FDA, we could challenge them to provide us their rules for SDTM and SEND in a Schematron file.

Another possibility is to use XQuery. XQuery is another W3C open standard and is a query language for XML documents and e.g. used a lot to query native XML databases.

Now consider the rule: "the value of 'ARMCD' in the DM dataset should not exceed 20 characters in length". How would this be written in XQuery?
Here is the rule in machine-executable XQuery:

(: Rule FDAC067 :)
declare namespace def = "";
declare namespace odm="";
declare namespace data="";
(: get the OID for ARMCD :)
for $s in doc('/db/fda_submissions/cdiscpilot01/define_2_0.xml')//odm:ItemDef[@Name='ARMCD'][1]
let $oid := $s/@OID
(: select the ARMCD data points :)
for $armrecord in doc('/db/fda_submissions/cdiscpilot01/DM.xml')//odm:ItemGroupData/odm:ItemData[@ItemOID=$oid]
(: get the record number :)
let $recnum := $armrecord/../@data:ItemGroupDataSeq
(: check the string length of the ARMCD value :)
where string-length($armrecord/@Value) > 20
return <error recordnumber="{$recnum}" rule="Rule FDAC067">Invalid value for ARMCD {$armrecord/@Value} - it has more than 20 characters</error>

The first three lines declare the namespaces used in Dataset-XML and define.xml
The third line takes the define.xml file and extracts the "ItemDef" node for which the "Name" attribute has the value "ARMCD". This is the SDTM variable we are looking for.
The next line then extracts the OID of the "ARMCD" variable which we need in the Dataset-XML file.
The following lines ("for" line and "where" line) then iterates over all the "ItemData" elements in the DM.xml file that have the OID retrieved in the previous line: so all the "ARMCD" data points.
The next line then whether the length of the ARMCD value is larger than 20 (characters) and if so, returns an error message in XML format.

Now again, I didn't test this completely yet, but given the resources the FDA has (2014 budget is $4.7 billion), I would expect that it would be not too difficult for the FDA to publish their SDTM and SEND rules as either Schematron or XQuery.

If there are no such plans, maybe they can sponsor a project at our university. It would also make a nice master thesis...

1 comment:

  1. i was just browsing along and came upon your blog. just wanted to say good blog and this article really helped me.
    Clean Email List