Working on and with CDISC Standards: Why SUPPQUAL sucks

Following up on my previous post on NSVs (Non-Standard Variables) which are banned to SUPPQUAL datasets in SDTM, I was wondering why it would be so hard (for the FDA) to recombine such NSVs with their parent record (or group of records) in the parent domain.
So I tried it myself.

We are currently working hard on an XML based format for SDTM (and ADaM and SEND) submissions. It is based on ODM "ClinicalData" and "ReferenceData". You can have a sneak preview here. We (one of my Masters student and I) am also working on a "smart" viewer for such data. I will report on the "Smart SDTM Viewer" later.
In the new XML format, we support that NSVs remain in their parent record, i.e. there will just be an "ItemData" for the NSV within the parent original "ItemGroupData" which contains also all the standard variables. In the define.xml, the corresponding metadata "ItemDef" then just has a flag - we use Role="SUPPLEMENTAL QUALIFIER".
Also the "Smart SDTM Viewer" supports this: such columns are simply colored differently.

As we needed test files for new format and the viewer, I took the recently published "Updated Pilot Submission Package" SAS Transport files (available from the "members only" part of the CDISC website) for the famous LZZT study. I converted them into our XML format and then tried to recombine SUPPXX datasets with their parent dataset. For doing so, I wrote an XSLT-2 stylesheet allowing two XML files to be recombined into a single one.

Everything went well for the SUPPDM, SUPPDS and SUPPAE datasets. Trouble started with SUPPLB.
The latter has over 64000 records and has a size of 55MB. The parent LB.xpt dataset has almost 60000 records and has a size of 33MB. This means that there is somewhat more than 1 NSV per LB record.
The fact alone that the size of SUPPLB is larger than that of the parent dataset shows that this format is not very efficient.
Now I started to run the transformation and waited, waited ... waited. I went to bed late in the evening and was surprised the following morning that the transformtion was still running.
Ultimately, it took over 2 days to recombine SUPPLB with LB.

So I looked into my recombination algorithm and tried to improve it. I soonly recognized there was not much room for improvement.

The problem came up again when working on the "Smart SDTM Viewer". It has a feature that when clicking a record in a SUPPXX "tab" or a record in the "RELREC" or "CO" "tab" in the viewer, one can be navigated to the parent record with a few mouseclicks only (try this using the SASViewer...). For FDA reviewers, I presume this will be a "killer feature".
Now I wanted to do the opposite, i.e. each record in the parent dataset that has one or more corresponding records in either a SUPPQUAL, CO or RELREC dataset would be marked (e.g. using another background color). In order to do so, a lookup must be done for each record in SUPPXX, RELREC and CO datasets.
What is the problem?

If you look at the structure of SUPPQUAL (CO and RELREC are very similar), you see that there is an IDVAR ("identifying variable") and IDVARVAL ("identifying variable value"). IDVAR can in principal be any variable from the parent dataset. Usually one chooses for --SEQ, but other popular identifiying variables are --SPID, --CAT, or even --VISIT or --VISITNUM.
Now when starting searching from the SUPPQUAL record, it is easy, as you know (thanks to IDVAR and IDVARVAL) what to look for in the parent domain.
When starting searching from the parent dataset and record however, you do not know (and the computer either) what to look for. You will probably start with USUBJID, and select all records in the SUPPQUAL dataset for that USUBJID. Then for each of these, you read what IDVAR and IDVARVAL are and check whether these correspond with the selected source record in the parent dataset. If there is a match, you create the NSV "ItemData" and add it as the last variable in the parent record (ItemGroupData). If you have 60000 records in the LB dataset, and 64000 in the SUPPLB dataset, all these lookups take a lot of time.

Conclusion: the way SUPPQUAL was developed easily allows to do a lookup from the SUPPQUAL record for the parent record. The other way around however is much more difficult, and will always take a lot of computer time if it has to be done for each record in large datasets such as the LB dataset.
So it is no wonder at all that FDA reviewers are unable to view NSVs together with the parent data in their standard review tools.

I do not know what was in the mind of the developers of SUPPQUAL at that time, but I would now consider it as a major design error.
There could have better choices, which I will explain in a future blog.

Working on and with CDISC Standards

Sunday, April 21, 2013

Why SUPPQUAL sucks

No comments:

Post a Comment