Tuesday, August 4, 2015

Define.xml and stylesheets

I have doubted a long time whether I should write this blog entry or not. The trigger then to do it came from an entry in the OpenCDISC forum stating "Can the indentation of items in the sidebar be controlled for all font sizes? Indentations appear normally for small fonts, but become irregular for larger fonts. Technical note: I am viewing define.xml files in Internet Explorer."

First of all, this has nothing to do with OpenCDISC. Probably however, the writer used an OpenCDISC tool to generate the define.xml after having generated the SDTM files (which I consider bad practice) and then viewed the result using Internet Explorer.
The writer of that entry doesn't even seem to realize that  a stylesheet is used for representing the define as HTML in the browser. For him/her, define.xml is what is seen in the browser.

Define.xml is however much more, it contains the metadata of your submission in XML, and not in HTML. So it can and should be used to validate the submission data themselves. Unfortunately, most validators even don't do that. The argument (sic): "Unfortunately the industry compliance with define.XML standard is not high enough to rely on user-provided metadata".

But today I want to discuss a somewhat different topic: stylesheets.

The define.xml specification comes with a sample XSLT stylesheet that was developed by Lex Jansen (SAS), member of the CDISC define.xml development team. It is one of the best stylesheets I have ever seen. Even though, we regularly read complaints from people that they want it ... different. They do not seem to realize (or don't want to) that this is just a sample stylesheet, and that providing a stylesheet (not necessarily the one developed by Lex) is their own responsibility when submitting data to the FDA. So if they want to have changes to the stylesheet, they should make them themselves.

Now, what is an XSLT stylesheet?
A stylesheet transforms the XML into something else. The "T" in XSLT stands for "Transformation" isn't it? In many cases, the transformation is done to HTML (as in the define-stylesheet), but stylesheets can also transform XML into PDF, CSV, text files, SQL, or other XML...
So what the user sees in the browser when (thinking) he/she is opening the define.xml, is principally not the define.xml, but it is the visualization of the HTML that is generated by the stylesheet starting from the information in the define.xml.
So, essentially (and don't misunderstand me), what-you-see-is-not-what-you-have.

Now, Lex's stylesheet is an extremely good one, and it makes most information that is in the define.xml XML file display in a very-user friendly way - you can trust Lex and the define.xml team.

Transformation means manipulation (in the good sense of the word). One can however also use stylesheets to manipulate data in the bad sense of the word. Now, we are all honest people, and we would never never think about changing the define-stylesheet so that the information seen in the browser does not correctly represent what is in the define.xml XML file itself.

That is where the devil in me starts to speak ...

Let us look at a simple example: the "Key" column that is seen in the table where the variables for each dataset are defined. It looks like (here for the DS domain):

The XSLT for it in the define-stylesheet is:

        <xsl:for-each select="./odm:ItemRef">
        <td class="number"><xsl:value-of select="@KeySequence"/></td>

Let us now make a small change to the stylesheet:

        <!-- added J.Aerts -->
        <xsl:variable name="MAXKEYSEQUENCE" select="max(./odm:ItemRef/@KeySequence)"/>
        <xsl:variable name="MAXKEYSEQUENCEPLUSONE" select="$MAXKEYSEQUENCE+1"/>
        <!-- end of addition J.Aerts -->
        <xsl:for-each select="./odm:ItemRef">
<!-- <td class="number"><xsl:value-of select="@KeySequence"/></td> -->
            <xsl:when test="@KeySequence != ''">
                <td class="number"><xsl:value-of select="$MAXKEYSEQUENCEPLUSONE - @KeySequence"/></td>

And what you then see in the browser is:

Do you see the difference? The values for the "Key" have been reversed! I.e. the lowest key number has become the highest and the highest has become the lowest!
But we did not change anything in the define.xml file itself isn't it? We only made a minor change to the stylesheet. Although this is a pretty harmlous example, it demonstrates that the result of a stylesheet does not necessarily represent the source XML data.
Again, we are honest people, and we would never never do something like this, and especially not when submitting data to the FDA.

So what do we learn from this?

- stylesheets should be validated. Does a stylesheet really truly visualize the data from the define.xml?
- it is the sponsor's responsibility (and not the one of Lex or of CDISC) to provide a stylesheet that truly visualizes what is in the define.xml
- the FDA should use its own stylesheets
- what you see in the browser (when a stylesheet is used), is not the define.xml
- the define.xml is a machine readable XML file defining the metadata for a submission and should be used as such
- what you see in the browser is just a human-friendly representation of what is in the define.xml - decisions should not be based on this "view"
- people should stop thinking about define.xml being a replacement for define.pdf
- in submission teams at sponsor companies, there should be at least 1-2 persons with good XML knowledge (it's easy, my students learn it in just 2 x 1.5 hours)

Comments are as always extremely welcome!

Thursday, February 12, 2015

Rule FDAC084 is just damned wrong

The FDA has recently published a set of "SDTM rules", unfortunately in Excel format, which is not machine-executable. So I started working on an XQuery representation, which will soon be available through a set of web services. You can already find some examples in my previous blog entries.

When working on these rules, I found that:
  • about 10% of them is just damned wrong
  • another about 10% is ununderstandable, even for people with a lot of experience in SDTM and define.xml
 The most notorious of the "just damned wrong" rules is surely rule FDAC0154: "Missing value for --ORRESU, when --ORRES is provided".
As we all know, there are so many tests for which there are no units for the results, just to name a few:
  • pH is dimensionless
  • all qualitative tests have no units. For example: presence of Ketones in Urine by Test strip (LOINC 24356-8)
 Today, I want to discuss another rule however, which is a consequence of the SDTM myth that the combination of LBTESTCD, LBCAT, LBSPEC, and LBMETHOD uniquely describes a lab test.

The rule FDAC084 sounds: "Standard Units (--STRESU) must be consistent for all records with same Short Name of Measurement, Test or Examination (--TESTCD), Category (--CAT), Specimen Type (--SPEC) and Method of Test or Examination (--METHOD)"

A quick search trough the LOINC database shows that this rule is just damned wrong.

Just take the following combination:

One quickly finds following tests for this combination:
  • LOINC=25428-4  Glucose [Presence] in Urine by test strip
    and the designation "ordinal".
    So this test has no units
  • LOINC=50555-2 Glucose [Presence] in Urine by automated test strip
    with typical values being 1+ to 4+ and "negative".
    So again: no units
  • LOINC=5792-7  Glucose [Mass/​volume] in Urine by test strip
    with as typical unit: mg/dL
  • LOINC=22705-8 Glucose [Moles/volume] in Urine by test strip
    with as typical unit: mmol/L
 So essentially 4 different tests, all with the same combination of LBTESTCD, LBCAT, LBSPEC and  LBMETHOD.
2 of these tests have no units at all (25284-4 and 50555-2), with the two others having different units.

So what can we learn from these examples?
  • The combination of CDISC LBTESTCD, LBCAT, LBSPEC and  LBMETHOD does not uniquely describe a lab test
  • Even worse, when looking at other LB variables that are at least expected, there is no combination that can possibly ever uniquely describe lab tests
  • The rule FDAC084 is nonsense
  • The only method to uniquely identify lab tests is the LOINC code. Unfortunately LBLOINC is "permissible", with the consequence that you almost never find it in real submissions, and putting the LOINC code itself in LBTESTCD is not allowed.

Thursday, December 4, 2014

Machine executable FDA rules for SDTM

In my previous posts "FDA publishes Study Validation Rules" and "Follow up to 'FDA publishes Study Validation Rules'" I showed how these rules can be expressed in XQuery, an open W3C standard query language for XML documents and XML databases.

I made some good progress in the last few days, and could already implement and test about 1/6th of the rules. My "rule writing pace" even increases, as I get more and more experience with the XQuery language, which was also pretty new for me.

So I wonder why the FDA (with considerably more resources than I have) did not publish these rules as machine-executable rules.

One of the great things of XQuery, is that one can easily do cross-document quering.
Another example is given below (I hope it displays well). It is the XQuery for rule FDAC049, requiring that there are no EX records for subjects that were not assigned to an arm (ARMCD='NOTASSGN' in DM). It took me about 15 minutes to develop and test this rule. Here it is:

(: Rule FDAC49: EX record is present, when subject is not assigned to an arm: Subjects that have withdrawn from a trial before assignment to an Arm (ARMCD='NOTASSGN') should not have any Exposure records :)
xquery version "3.0";
declare namespace def = "http://www.cdisc.org/ns/def/v2.0";
declare namespace odm="http://www.cdisc.org/ns/odm/v1.3";
declare namespace data="http://www.cdisc.org/ns/Dataset-XML/v1.0";
declare namespace xlink="http://www.w3.org/1999/xlink";
let $base := '/db/fda_submissions/cdisc01/'
let $define := 'define2-0-0-example-sdtm.xml'
(: Get the DM dataset :)
let $dmdatasetname := doc(concat($base,$define))//odm:ItemGroupDef[@Name='DM']/def:leaf/@xlink:href
let $dmdataset := concat($base,$dmdatasetname)
(: Get the EX dataset :)
let $exdatasetname := doc(concat($base,$define))//odm:ItemGroupDef[@Name='EX']/def:leaf/@xlink:href
let $exdataset := concat($base,$exdatasetname)
(: get the OID for the ARMCD variable in the DM dataset :)
let $armcdoid := doc(concat($base,$define))//odm:ItemDef[@Name='ARMCD']/@OID  (: supposing there is only one :)
(: and the OID of USUBJID - which is the third variable :)
let $usubjidoid := doc(concat($base,$define))//odm:ItemGroupDef[@Name='DM']/odm:ItemRef[3]/@ItemOID
(: we also need the OID of the USUBJID in the EX dataset :)
let $exusubjidoid := doc(concat($base,$define))//odm:ItemGroupDef[@Name='EX']/odm:ItemRef[3]/@ItemOID
(: in the DM dataset, select the subjects which have ARMCD='NOTASSGN' :)
for $rec in doc($dmdataset)//odm:ItemGroupData[odm:ItemData[@ItemOID=$armcdoid and @Value='NOTASSGN']]
    let $usubjidvalue := $rec/odm:ItemData[@ItemOID=$usubjidoid]/@Value
    (: and the record number for which ARMCD='NOTASSGN' :)
    let $recnum := $rec/@data:ItemGroupDataSeq
    (: and now check whether there is a record in the EX dataset :)
    let $count := count(doc($exdataset)//odm:ItemGroupData[odm:ItemData[@ItemOID=$exusubjidoid]])
    where $count > 0  (: at least one EX record was found :)
    return <warning rule="FDAC049" recordnuber="{data($recnum)}">{data($count)} EX records were found in EX dataset {data($exdatasetname)} for USUBJID={data($usubjidvalue)} although subject has not been assigned to an arm (ARMCD='NOTASSGN') in DM dataset {data($dmdatasetname)}</warning>

Comment lines are in (: this is a comment line :)

And here is a snapshot of the test result:

I guess that I would not have been able to develop and test this rule in Java in less than 15 minutes...

The advantage of using an open standard like XQuery is that everyone is using the same rule, and that there is no room for different interpretations, unlike in a Java programm, which essentially is a "black box" implementation. As such, these rules in XQuery, can function as "reference implementation", meaning that any software application (such as a Java programm) needs to give the same results as the reference implementation does.

Monday, November 24, 2014

Follow up to "FDA publishes Study Data Validation Rules"

My good friend and colleague at CDISC Sam Hume picked this up, corrected my code and tested it on real Dataset-XML files. Here is his code:

declare namespace def = "http://www.cdisc.org/ns/def/v2.0";
declare namespace odm="http://www.cdisc.org/ns/odm/v1.3";
for $s in doc('file:/c:/path-here/define.xml')//odm:ItemDef[@Name='ARMCD'] 
    let $oid := $s/@OID
    for $armvalue in doc('DM.xml')//odm:ItemGroupData//odm:ItemData[@ItemOID=$oid]
        where string-length($armvalue/@Value) > 20
            return <error>Invalid value for ARMCD {$armvalue} - it has more than 20 characters</error>

He used oXygen XML Editor and ran the XQuery on a file rather than on a native XML database (I use eXist).

So I tried another one: rule #175: "Missing value for --STAT, when --REASND is provided" with: "Completion Status (--STAT) should be set to 'NOT DONE', when Reason Not Done (--REASND) is populated". Here is my XQuery (running against the eXist native XML database where I loaded the test files):

(: Rule FDAC175 :)
declare namespace def = "http://www.cdisc.org/ns/def/v2.0";
declare namespace odm="http://www.cdisc.org/ns/odm/v1.3";
declare namespace data="http://www.cdisc.org/ns/Dataset-XML/v1.0";
(: get the OID for VSSTAT :)
for $s in doc('/db/fda_submissions/cdisc01/define2-0-0-example-sdtm.xml')//odm:ItemDef[@Name='VSSTAT'][1]
let $vsstatoid := $s/@OID
(: get the OID for VSREASND :)
let $vsreasndoid := $s/../odm:ItemDef[@Name='VSREASND']/@OID
(: select the VSREASND data points  :)
for $record in doc('/db/fda_submissions/cdisc01/vs.xml')//odm:ItemGroupData/odm:ItemData[@ItemOID=$vsreasndoid]
(: get the record number :)
let $recnum := $record/../@data:ItemGroupDataSeq
(: and check whether there is a corresponding VSSTAT :)
let $vsstat := $record/../odm:ItemData[@ItemOID=$vsstatoid]
where empty($vsstat)  (: VSSTAT is missing :)
return <error recordnumber="{$recnum}" rule="FDAC175">Missing value for VSSTAT when VSREASND is provided - VSREASND = {$record/@Value}</error> 

I added some comments so that the code is self-explaining.
Essentially, the FDA rule is not one rule, it are two rules. So I still need to adapt the code somewhat so that is also checks on the present of "NOT DONE" for VSSTAT. Here is the corrected part:

where empty($vsstat) or data($vsstat/@Value) != 'NOT DONE'
return <error recordnumber="{$recnum}" rule="FDAC175">Missing or invalid value for VSSTAT when VSREASND is provided - VSREASND = {$record/@Value}</error>

The data() function is important to retrieve the value from the attribute instead of getting the attribute as a node.

In the next few weeks, I will publish more about this nice way of defining the FDA rules extremely precise (no room for different interpretations) and in a machine-executable way.
If we can get this done, everybody will be playing by the same rules ... Isn't that wonderful?

Thursday, November 20, 2014

FDA publishes Study Data Validation Rules

The FDA recently published its "Study Data Validation Rules" (http://www.fda.gov/forindustry/datastandards/studydatastandards/default.htm) for SDTM and SEND.
Unfortunately the rules come as a set of Excel files, so not vendor neutral (Excel is a product of the company Microsoft) and the rules themselves are unfortunately not machine-readable nor machine-executable.

A snapshot from the Excel file shows how the rules are defined:

Rule #67 saying that the value of "ARMCD" in the DM, TA and TV dataset should not exceed 20 characters in length.

This one is clear, but other of the over 300 rules are harder to interprete. What about:

"<Variable Label> (<Variable Name>) variable values should be populated with terms found in '<Codelist Name>' (<NCI Code>) CDISC controlled terminology codelist. New terms can be added as long as they are not duplicates, synonyms or subsets of existing standard terms."?

Anyway - not machine readable nor machine executable.

Now, many of you will say: "Wait a minute Jozef, we cannot expect the FDA to provide validation source code for different languages like Java, C#, etc.".

This is where XML comes in. Dataset-XML was recently developed to replace SAS-XPT so that we can take advantage of what XML offers us.
Now there is a W3C language for validating information in XML files, named Schematron. Schematron is an open, vendor-neutral standard, and very easy to implement. Unfortunately, it cannot (yet) - as far as I know - validate files that need information from other files, such as from the define.xml file. If you would "copy" the define.xml file into each Dataset-XML for the same submission, we could use Schematron. So as soon as Dataset-XML is accepted by the FDA, we could challenge them to provide us their rules for SDTM and SEND in a Schematron file.

Another possibility is to use XQuery. XQuery is another W3C open standard and is a query language for XML documents and e.g. used a lot to query native XML databases.

Now consider the rule: "the value of 'ARMCD' in the DM dataset should not exceed 20 characters in length". How would this be written in XQuery?
Here is the rule in machine-executable XQuery:

(: Rule FDAC067 :)
declare namespace def = "http://www.cdisc.org/ns/def/v2.0";
declare namespace odm="http://www.cdisc.org/ns/odm/v1.3";
declare namespace data="http://www.cdisc.org/ns/Dataset-XML/v1.0";
(: get the OID for ARMCD :)
for $s in doc('/db/fda_submissions/cdiscpilot01/define_2_0.xml')//odm:ItemDef[@Name='ARMCD'][1]
let $oid := $s/@OID
(: select the ARMCD data points :)
for $armrecord in doc('/db/fda_submissions/cdiscpilot01/DM.xml')//odm:ItemGroupData/odm:ItemData[@ItemOID=$oid]
(: get the record number :)
let $recnum := $armrecord/../@data:ItemGroupDataSeq
(: check the string length of the ARMCD value :)
where string-length($armrecord/@Value) > 20
return <error recordnumber="{$recnum}" rule="Rule FDAC067">Invalid value for ARMCD {$armrecord/@Value} - it has more than 20 characters</error>

The first three lines declare the namespaces used in Dataset-XML and define.xml
The third line takes the define.xml file and extracts the "ItemDef" node for which the "Name" attribute has the value "ARMCD". This is the SDTM variable we are looking for.
The next line then extracts the OID of the "ARMCD" variable which we need in the Dataset-XML file.
The following lines ("for" line and "where" line) then iterates over all the "ItemData" elements in the DM.xml file that have the OID retrieved in the previous line: so all the "ARMCD" data points.
The next line then whether the length of the ARMCD value is larger than 20 (characters) and if so, returns an error message in XML format.

Now again, I didn't test this completely yet, but given the resources the FDA has (2014 budget is $4.7 billion), I would expect that it would be not too difficult for the FDA to publish their SDTM and SEND rules as either Schematron or XQuery.

If there are no such plans, maybe they can sponsor a project at our university. It would also make a nice master thesis...

Saturday, November 1, 2014

No to "Null Flavors"

Last week, I attended (part of) the CDISC webinar about an upcoming new batch "public review" SDTM-IG (v.3.3 - batch 2). It gave me good and bad news. First the bad news:
- even more new domains and many new variables. I am afraid that the CDISC SDTM trainings will soon need to be extended to 3 days instead of the 2 days right now.

The good news is that the SDTM team now proposes that "non-standard" variables (that until now are to be "banned" to SUPPXX data sets) may be kept in the parent domain (where they belong) and are marked in the define.xml by Role="Non-Standard Identifier" or Role="Non-Standard Qualifier" or Role="Non-Standard Timing".
This is something many of us ask already for years, essentially since define.xml 1.0 was published. You can read somewhat about this in my prior blog entries "Why SUPPQUAL sucks" and "SDTM and non-standard variables".

Very recently, there was also a webinar given by Diane Wold about the use of "Null Flavors" in CDISC. Now, Diane is one of the persons in CDISC that I highly appreciate, but in my personal opinion, she is completely wrong in this case: in my opinion, "Null Flavors" are evil.

Let me explain. "Null Flavors" have been developed by HL7 in HL7-v3 in order as a mechanism for the case where a value is not known, or cannot be represented by the HL7-v3 framework.
"Null flavors" are highly contested, even within HL7, e.g. see the blog "Smells like I dunno" of Keith Boone, one of the few "HL7-v3 gurus" and author of the best book about HL7-v3 and CDA.
One of the things I have against the "null flavors" is that it forces people to make a categorization on a reason why a data point is missing (or not representable in the HL7 framework). This categorization is extremely arbitrary, so it is of essentially no help when comparing data points. I.m.o. they just just write the reason as an extra data point (like --REASND in SDTM) as free text.
Another reason is that it encompasses values that DEFINITELY are not null. Examples are "TRC" ("trace" - which is definitely not null), "QS" ("Quantity Sufficient") meaning "a bulk/large amount of material sufficient to fill up until a certain level" (can a large amount be "null"?), "PINF" ("positive infinite") and "NINF" ("negative infinite), two amounts that every last class primary school student knows are not null. Even worse, CDISC is abusing "PINF" in the trial design datasets to state "there is no upper limit" (in the number of particants). A very strange way to define this: first set that the maximum number of participants is NULL, and then add a "flavor" saying that it is unlimited. My math school teacher probably turns around in his grave now ...

In Austria, our national Electronic Health Record system is based on HL7-v3 and CDA. But we do ONLY allow two "null flavors" which are really about nulls: one expresses that a patient has no austrian social security number (e.g. tourists), the other one expressing that the patient does have an austrian social security number, but we do not know it, e.g. as he/she forgot to bring the SSN card.
All other 13 "null flavors" are forbidden in the austrian EHR.

My opinion is clear: we should not copy the errors the HL7 organization made.

Sunday, July 13, 2014

Why SDTM should NOT contain --TEST as a variable

All the findings domains in the SDTM have both --TESTCD (test code) and --TEST (test name) variables. There is a pure 1:1 relation between --TESTCD and --TEST: for each unique value of --TESTCD there is a single unique value of --TEST. For example for LBTESTCD=GLUC, only LBTEST=Glucose is allowed.

Here is a view from a sample SDTM submission:

Nice isn't it? But did you notice that there is an error? There is an LBTESTCD="FRUCT" with LBTEST="Glucose". Would a reviewer really notice? Can a machine easily find out that this is an error?

Although there is a 1:1 relation between both, CDISC published codelists for both separately. So there is a codelist for LBTESTCD and another for LBTEST. A bit strange isn't it?
The 1:1 relation between individual terms is then established by the "NCI code". For example, both "GLUC" (in codelist LBTESTCD) as well as "Glucose" (in codelist LBTEST) have the same "NCI code" which is "C105585". So if one wants to validate whether a test code and test name really fit together, then one needs to go over the NCI code, and that is essentially what OpenCDISC is doing.
Of course, this leads to trouble when extending a codelist with own terms for --TESTCD and --TEST. Using the "CDISC way" there is no good way in define.xml to state that an added --TESTCD belongs together with an added --TEST value. An example could e.g. be "FRUCTO" and "Fructose" which are currently not in the CDISC-CT, and which need to be added to two separate codelists.
It was also found by my colleague Dave Iberson-Hurst that this approach (linking over the NCI code) has led to versioning issues, i.e. terms changing names between versions without notice!
Also, we found a few cases where this linking mechanism leads to false/wrong values for --TEST.

Although seeing the test name for each given test code is a nice feature for the reviewer, one should ask oneselve whether --TEST should really be submitted in the datasets themselves, as this not only violates the third normal form for good database design (see my previous posts) but also blows up the sizes of the data sets themselves. I estimate that data sets could be approximately 20% smaller when --TEST would not be submitted.

There are two major types of solutions for resolving these issues.

The first is to recognize that this 1:1 relation exists, and that --TEST is essentially metadata (data about data) and that the codelists for --TESTCD and --TEST are essentially one codelist, meaning that they should be merged. This can be done using the classic CodeListItem mechanism in define.xml with a "CodedValue" and a "Decode". For example:

A viewer can then retrieve the "Decode" value from the define.xml and display it in a column --TEST that is generated by the viewer itself (so --TEST is NOT in the submitted data set). In databases, this corresponds to a JOIN between two tables (one with data and one with metadata).

If a company sticks to published CDISC-CT, then a second solution comes into play: web services, i.e. CDISC is publishing the controlled terminology and makes it available as a web service, e.g. using REST or SOAP (this could be done through SHARE). A viewer tool (or any other software)  then retrieves the value of the submitted --TESTCD (e.g. GLUC) and then looks up the value of the corresponding --TEST (test name) using the web service.
One of our students, Mr. Wolfgang Hof has designed and implemented such a web service on a local server at the university. He also implemented it (client side) in our "Smart Dataset-XML viewer": when the user hovers the mouse over a --TESTCD value, the web service is triggered, a remote database queried, and the details about the test are displayed as a tooltip:

Both approaches essentially correspond to the idea that tools should retrieve metadata for data, and that metadata are kept separated from the data themselves (as is also done in good database design).
An error such as in the above example can then not occur anymore...
So if we following this "good practice" principle, we do not need --TEST anymore.

Let's take the next step and throw --TEST out of the SDTM. It is metadata, not data!