Sunday, April 21, 2013

Alternatives for SAS Transport 5 - 20 years ago

I have been often criticized that I criticize SAS Transport 5 without proposing a better alternative.
This blog contribution is a response to that.

Many many years ago (I haven't been able to find out exactly how many years ago - If you know it, please let me know), the FDA decided for SAS Transport 5 (".xpt" format) for submission of datasets. It is not clear to me how and why they came to this decision. I discussed it with several people that were already around at that time, but their statements are not unambiguous.
It is currently however clear that this was not a good choice, but that is an easy statement many years later.

First of all, is the choice for transport format important?
Essentially, SDTM is a semantic standard: it is about content.
SAS XPT, XML, CSV and others are transport formats, you can transport any kind of data with them.
So, one could transport SDTM data using any transport format, as the two are essentially independant. Stating e.g. that HL7-v3 is a better format for transporting SDTM data as it allows integration with healthcare is nonsense! It is the same as stating that "using the same truck for transporting cows and for transporting oranges allows us to breed cows that produce orange juice".
However, we cannot expect from the FDA that it accepts SDTM (and SEND and ADaM) data in any transport format - a choice (only one, or maybe a few) must be selected.

So what could have been the alternatives 20 years ago?
First we must take into account that at that time we did not have define.xml for the metadata (XML was still in it's early stages and now well known): metadata was usually submitted as a text file or PDF.

Let us look at few alternatives ...

A. CSV
Comma-separated (or similar) format has the advantage that it is very compact. It is for example (using a vertical bar as field separator) used in HL7-v2 messages with a lot of success.
I recently generated some CSV files from SDTM XPT datasets and for example found that the LB dataset from the FDA Pilot "LZZT" study was reduced by a factor 6 in file size. "Great" you would say when you think about the "file size complaints" of the FDA. However CSV also has some major disadvantages. First of all: how to declare the variable names and labels? That can be solved by adding two extra lines at the top of the document containing them. But what about associated codelists for coded variables (not to speak about valuelists)? That is much more difficult. Of course one can add extra CSV datasets for the coded values (enumerations) too, but how can one make the link between both?
I think it can be done somewhat artificially, but it is not ideal.

B. SQL
SQL is a data definition and query language for relational databases. It is standardized as ANSI-SQL and already very long in place and very much used. In a previous blog I wrote that SDTM is not to be considered as a database, but as a "View" on a database. So instead of obtaining the database itself, the FDA has decided that we must submit "Views" on the database. This has many disadvantages, as in a "View" it is much more difficult to detect data inconsistencies, and it is difficult (if not impossible) to 100% reconstruct the original database. The choice for the "View" surely was surely related to the choice for SAS Transport 5.
So why not submit the database instead of the view on the database? It would have many advantages, as well for the submittors as for the FDA itself.
How can we exchange a database in a vendor-neutral way (one of the requirements)? The answer is "ANSI SQL". The design and metadata of the database could be exchanged using ANSI-SQL "Create Table" statements, with primary key and foreign keys and everything, so also defining the relations between the tables. The actual data could then be exchanged using "SQL Insert" statements.
What would be the advantages? First of all we see that SDTM is in each new version containing new and more derived variables. For example, RFXSTDTC (Date/Time of first Study Treatment) is also already present in the EX (Exposure) dataset. RFXSTDTC was only added to DM because the reviewers are unable to combine both datasets (they just use the viewer which is not "understanding" SDTM). Using ANSI-SQL we could (and we MUST - database "normal forms" require this) get rid of redundant information, which would mean that our data is much more consistent.
Once the FDA would have received the "SQL Create Table" and "SQL Insert" datasets, it could easily import them into a database (the first step in using a data warehouse by the way), with consistencies being checked over the foreign keys. They could then create their OWN views on the data (ANSI-SQL "Create View", such as views very similar to the current SDTM datasets.
This would mean that much less data would need to be submitted (I guess a 50% reduction or more) and that the FDA could easily create any possible views on the data, something they currently cannot do.

Remark that it is still possible to keep the "vertical" table structure as in the SDTM (in the database world, this is called the "entity-attribute-value" (EAV) model. It would also allow to keep the supplemental qualifiers within their parent dataset (a major problem in the current SDTM implementation).

C. Any other?
Would you have had another choice for a vendor-neutral transport format for dataset submission 20 years ago? Please let me know or write a comment.


My personal choice:
If I would have been the FDA twenty (or so) years ago, I would have chosen for ANSI-SQL as a vendor-neutral data submission format. But I am not the FDA, and I was not working in clinical research yet at that time. I do not know whether it has ever been considered (if you know, let me know).

So why not ANSI-SQL as the new transport format today?
Data exchange between computers and especially databases is usually not done using SQL nowadays. The reason is that we have a better alternative, namely XML. XML is not limited to representing flat, two-dimensional data, but has much more features. In a world "that is not flat", it can be used for representing many-dimensional data (though SDTM currently remains "flat" - will it after SHARE?) and it can also be used to e.g. visualize data (using stylesheets) something that cannot be done using the SQL format.

Your comments are welcome, as usual!


Why SUPPQUAL sucks

Following up on my previous post on NSVs (Non-Standard Variables) which are banned to SUPPQUAL datasets in SDTM, I was wondering why it would be so hard (for the FDA) to recombine such NSVs with their parent record (or group of records) in the parent domain.
So I tried it myself.

We are currently working hard on an XML based format for SDTM (and ADaM and SEND) submissions. It is based on ODM "ClinicalData" and "ReferenceData". You can have a sneak preview here. We (one of my Masters student and I) am also working on a "smart" viewer for such data. I will report on the "Smart SDTM Viewer" later.
In the new XML format, we support that NSVs remain in their parent record, i.e. there will just be an "ItemData" for the NSV within the parent original "ItemGroupData" which contains also all the standard variables. In the define.xml, the corresponding metadata "ItemDef" then just has a flag - we use Role="SUPPLEMENTAL QUALIFIER".
Also the "Smart SDTM Viewer" supports this: such columns are simply colored differently.

As we needed test files for new format and the viewer, I took the recently published "Updated Pilot Submission Package" SAS Transport files (available from the "members only" part of the CDISC website) for the famous LZZT study. I converted them into our XML format and then tried to recombine SUPPXX datasets with their parent dataset. For doing so, I wrote an XSLT-2 stylesheet allowing two XML files to be recombined into a single one.

Everything went well for the SUPPDM, SUPPDS and SUPPAE datasets. Trouble started with SUPPLB.
The latter has over 64000 records and has a size of 55MB. The parent LB.xpt dataset has almost 60000 records and has a size of 33MB. This means that there is somewhat more than 1 NSV per LB record.
The fact alone that the size of SUPPLB is larger than that of the parent dataset shows that this format is not very efficient.
Now I started to run the transformation and waited, waited ... waited. I went to bed late in the evening and was surprised the following morning that the transformtion was still running.
Ultimately, it took over 2 days to recombine SUPPLB with LB.

So I looked into my recombination algorithm and tried to improve it. I soonly recognized there was not much room for improvement.

The problem came up again when working on the "Smart SDTM Viewer". It has a feature that when clicking a record in a SUPPXX "tab" or a record in the "RELREC" or "CO" "tab" in the viewer, one can be navigated to the parent record with a few mouseclicks only (try this using the SASViewer...). For FDA reviewers, I presume this will be a "killer feature".
Now I wanted to do the opposite, i.e. each record in the parent dataset that has one or more corresponding records in either a SUPPQUAL, CO or RELREC dataset would be marked (e.g. using another background color). In order to do so, a lookup must be done for each record in SUPPXX, RELREC and CO datasets.
What is the problem?

If you look at the structure of SUPPQUAL (CO and RELREC are very similar), you see that there is an IDVAR ("identifying variable") and IDVARVAL ("identifying variable value"). IDVAR can in principal be any variable from the parent dataset. Usually one chooses for --SEQ, but other popular identifiying variables are --SPID, --CAT, or even --VISIT or --VISITNUM.
Now when starting searching from the SUPPQUAL record, it is easy, as you know (thanks to IDVAR and IDVARVAL) what to look for in the parent domain.
When starting searching from the parent dataset and record however, you do not know (and the computer either) what to look for. You will probably start with USUBJID, and select all records in the SUPPQUAL dataset for that USUBJID. Then for each of these, you read what IDVAR and IDVARVAL are and check whether these correspond with the selected source record in the parent dataset. If there is a match, you create the NSV "ItemData" and add it as the last variable in the parent record (ItemGroupData). If you have 60000 records in the LB dataset, and 64000 in the SUPPLB dataset, all these lookups take a lot of time.

Conclusion: the way SUPPQUAL was developed easily allows to do a lookup from the SUPPQUAL record for the parent record. The other way around however is much more difficult, and will always take a lot of computer time if it has to be done for each record in large datasets such as the LB dataset.
So it is no wonder at all that FDA reviewers are unable to view NSVs together with the parent data in their standard review tools.

I do not know what was in the mind of the developers of SUPPQUAL at that time, but I would now consider it as a major design error.
There could have better choices, which I will explain in a future blog.

Saturday, February 2, 2013

SDTM and non-standard variables

The new draft SDTM-IG 1.4 package 2 contains an interesting statement in the "SDS Proposal for alternate Handling of Supplemental Qualifiers" (file: SDS_Proposal-Alternate_Handling_of_Supplemental_Qualifiers.pdf) document. I cite:

"The Supplemental Qualifiers structure was created to address the need to represent NSVs in the SDTM. It consists of a normalized data structure designed to allow for standard representation of what often is a wide variety of sponsor-specific variables.
The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the NSVs together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers".

Where "NSV" means "Non-standard variables".

Essentially the above statement says that reviewers are not able to combine variables in SUPP-- domains with the parent domain and bring them "back" to the original domain.

I am not surprised.

In conversations with FDA representatives, I heard that SDTM data is viewed by reviewers "as is". Usually there is no attempt to generate a database from the submission datasets. Without that, it is not easy to recombine SUPP-- datasets with the parent domain.
Obviously, there is not much database knowledge at the FDA. I have always regarded their attempt to come to a datawarehouse as "a bridge too far". If they are not able to generate a database, how would they be able to generate a datawarehouse?

The proposed solution that is provided in the above mentioned document is not new: the ODM team already proposed something very similar five years ago, but the teams ideas for an XML-ODM based replacement of SAS Transport 5 were vetoed by the FDA.
I did already write about that in the blog "SDTM Supplemental Qualifiers".

Generating a database from the submission datasets is not made easy by the structure of the SDTM. I already wrote a previous blog on this topic titled "Is SDTM a database and if so - is it a good one".

At the beginning, SDTM was developed to contain "collected data only", and only a minimal amount of "derived data" would be present. The idea was that derived data would go into ADaM.
It however soon was found out that derived data would need to go into the datasets themselves. A simple example is "AGE" in the Demographics (DM) dataset.
AGE is in most cases never collected - it is derived (calculated) from BRTHDTC (birth date) and RFSTDTC (reference start date). Both these are present and required in the DM dataset. So why is AGE than still necessary?
IF SDTM were a database representation, "AGE" would even not be allowed, as it violates the "third normal form", stating that "transitive dependency" is not allowed.
However I suspect that, as it was soon found out that the FDA could not, or did not want to generate a database from each submission, "AGE" was included in the SDTM as a variable of the DM domain.

This turned SDTM into a "View" on a database.
Regenerating a real database from a view on it, is however not an easy task.

In the last few years, we have again and over again seen that new variables are added to the SDTM. In many cases these are derived variables.
So for example, SDTM 1.3 / SDTM-IG 3.1.3 adds following variables to the DM domain:
  • RFXSTDTC: date/time of first study treatment [can however also be obtained from the EX dataset]
  • RFXENDTC: date/time of last study treatment [can however also be obtained from the EX dataset]
  • RFICDTC: date/time of informed consent
  • RFPENDTC: date/time of end of participation [can very probably be obtained from SE (subject elements) and/or from DS (disposition)
  • DTHDTC: date/time of death: this is clearly derived - I cannot image this is a standard field on the Demographics form. I haven't seen it in CDASH either
  • DTHFL: death flag - similar
  • ACTARMCD: actual arm code - this really does not belong here. It should be a variable in the SE (subject elements) domain. But it looks as it was put in DM as either reviewers do not inspect the SE datasets and/or cannot combine it with the DM dataset.
The most obvious example of a derived variable is "--TEST". There is a 1:1 correspondence between "--TESTCD" and "--TEST", clearly violating the 3th normal form of database design. The somewhat crazy thing is that in define.xml this is not taken into account: there are separate codelists for "--TESTCD" and for "--TEST". As "--TEST" is a synonym qualifier for "--TESTCD", the correct construct should be like:
<CodeListItem CodedValue="SYSBP">
    <Decode><TranslatedText>Systolic blood pressure</TranslatedText></Decode>
</CodeListItem>

and "--TEST" should even NOT be in the SDTM dataset.
In a tool that can as well look up the metadata as the data itself, the column "--TEST" would either be generated "on the fly" or the test name (value of --TEST) could e.g. appear as a tooltip when the mouse is positioned over the cell with the test code. With the ancient SASViewer this is of course not possible (as it even does not understand define.xml).

I also have always been puzzled about why we both have --STRESC and --STRESN. For --STRESN the IG says: "copied in numeric format from --STRESC". If the value in --STRESN is a copy of the value in --STRESC, why do we need it then? Maybe to explain the reviewer that the value is a numeric one? But that is already stated in the define.xml (ValueList / ItemDef) isn't it?

So why all these derived variables?

I suspect that the reason is very simple: the FDA is not able to do anything with the submitted datasets than to inspect them with the ancient SASViewer. They are not able to combine datasets with each other (generate views on views), they are not able to generate databases from them.
So each time they need a new "view" they just ask CDISC to define a new (derived) variable that is then added to the standard. The more information the FDA will want to retrieve from the datasets, the more new (derived) variables they will ask for to be included in the SDTM standard.
As can be expected, this will lead to even more "huge" datasets (as the number of variables is continously increasing), and we are then surprised that the FDA is complaining about file sizes!

Is there a way out of this?

I think there is once we have SDTM in XML (ODM-based) format. When both define.xml and SDTM datasets use the same model/format (ODM), we can generate software (e.g. viewers) that allow to "look up" information, or "calculate it on the fly".
For example, for the DM domain, suppose we leave "AGE" out of the dataset. The viewer can then take care that "AGE" is calculated "on the fly" and displayed either as an additional column (although it was not in the dataset), or as a tooltip for one of the cells (e.g. tooltip on USUBJID or on BRTHDTC).
Similarly, for the Findings domains, suppose we leave "--TEST" (test name) out of the dataset and only provide "--TESTCD". As both are connected to each other through the "code / decode" in the metadata, the column "--TEST" can automatically be generated when loading the data from the information in the define.xml. Or the value for --TEST can be supplied as a tooltip on the "--TESTCD" cell in the table.

Once we have such an XML-based format, we can start thinking about deprecating a number of SDTM variables, as it will essentially be the tool (e.g. the viewer) that takes care that they are derived, either from within the same dataset (e.g. AGE) as from other datasets (e.g. RFXSTDTC, RFXENDTC) as from the metadata in define.xml (e.g. --TEST - test name).
This will bring us to a much cleaner SDTM with much less redundant information, and considerably smaller files (I estimate a 50% reduction).

The ultimate solution is however SHARE. Once we can allocate each data point as being a SHARE object, things will become much easier.
There has been a lot of discussion about a format for SHARE. I am however more and more convinced that the format is completly unimportant. I am more and more convinced that SHARE objects can be expressed as RDF, as define.xml and ODM, as Excel (if you like that).

But that's a topic for another blog entry ...


SDTM --ORRESU and controlled terminology

A very interesting discussion (partially by e-mail, partially on the web) has been taken place in the last weeks on the question on whether --ORRESU ("Original Units") must be under control of terminology or not.

The question arose from another discussion on why there is a CDISC codelist for units, although there is already an international standard for that (UCUM), which is well used by the hospital world, and for which the use is mandatory in exchange standards for electronic health records (EHRs) such as HL7-CCD and CDA and ISO-21090.
As I was informed by someone from the FDA, the use of UCUM is also mandatory in SPL (Structured Product Labeling).

The discussion was also triggered by the observation that when one extracts lab data from EHRs and use that in clinical research and it leads to an SDTM submission, the value in LBORRESU in some cases leads to a validation error when using OpenCDISC. A typical example is "mm[Hg]" which is the international symbol (UCUM) for "millimiter mercury" as a unit for the property "pressure". OpenCDISC throws an error as "mm[Hg]" is not in the "UNIT" controlled terminology codelist.
Other people remarked that when using paper case report forms, the way the unit is written on the CRF varies very often (when not preprinted). For example, for a mass concentration, they found it written as "Gr/dl", "Gramm per deciliter", "Gramm/dL", "g/dL" etc.. Only the latter is correct CDISC controlled terminology.
So if I find "Gramm per deciliter" on the CRF, what must go into LBORRESU?
The SDTM-IG for LBORRESU states "original units" for the variable label, but the column "Controlled Terms, CodeList or format" states "(UNIT)", meaning that the value must come from the "UNIT" codelist.
Whether this codelist is extensible (by the sponsor) is not mentioned in the SDTM-IG.
In my opinion, the wording "original units" contradicts with the statement that the "UNIT" codelist must be used.
Only in an ideal world, every researcher would exactly use what is in the CDISC "UNIT" codelist when writing (as narrative text) the unit on the paper CRF.

A response came from a distinguished SDTM pioneer stating that one needs to distinguish between the "actual unit" that came from the CRF, and the way it is represented in SDTM, and which needs to come from the UNIT codelist.
So that would mean that if the researcher had written "mill.Hg" on the CRF, this would never appear in VSORRESU, but needs to be "translated" into "mmHg".
Similarly, when the data comes from an EHR, the UCUM unit "mm[Hg]" will never go into VSORRESU but needs to be translated into "mmHg" again.

If we do so, how much is ORRESU then still "original unit"? Isn't such a translation a "derivation"?
What about "translation errors"? Can we trust that the unit is always translated correctly into one from the UNIT codelist? I (and a lot of other in the discussion) have some serious doubts.

What about tests for which there is a unit available, but for which there is no unit present yet in the UNIT codelist for that property? Do we need to wait submitting to the FDA until the moment that our "new term request" is approved?
And isn't it a bit strange that CDISC just choose to ignore/exclude a standard (UCUM) that is very well used, and choose to develop its own list for units? One statement in the discussion was that UCUM is not well known in smaller labs and by a number of LIMS systems, but is that a reason to develop a whole new list, and then to say that the labs and LIMS should learn that new list, instead of stating that they should learn UCUM? It reminds me about the cartoon that you can find at http://xkcd.com/927/.

One very good suggestion was given by a representative of a biotech company, proposing to have two variables for the "original" units, one named e.g. --RPRESU (new), the "reported result unit" which is "as collected", so not under controlled terminology, and --ORRESU (existing), which is the translation into something that is in the "UNIT" codelist. As such, the "as collected" unit would not be lost, and the reviewer at the FDA can use the (--ORRESU) term that is under CDISC controlled terminology because that is familiar to him/here.

So, when the information comes from an EHR, e.g. "mm[Hg]" (UCUM) would go into VSRPRESU and would then be "translated" into "mmHg" (CDISC-CT) which then goes into VSORRESU.
Which of the two (if not both) would then be made "expected" is also an interesting discussion.

Though this doesn't take away my objections agains "the reinvention of the wheel", I think it could be a very good compromise.

Whether derived variables should go into SDTM at all is another discussion. For example, we must submit "AGE" in the demographics domain although that is usually never collected, but derived from BRTHDTH (birth date) and RFSTDTC (reference start date). Whether such derived variables should be in SDTM is surely related to the question "Is SDTM a database (design), and if so - is it a good one?" which I discussed in a previous blog entry.
My next blog entry will probably go about why SDTM has more and more derived variables.


Saturday, December 29, 2012

Another copy-and-paste frustration

Yesterday and today, I worked on implementing the new SDTM oncology and a number of the new draft SDTM-IG 3.1.4 domains in my SDTM-ELT(TM) software. For each new domain, I created a template define.xml file that can be read by the software.

It was so frustrating!

The new domains come as a PDF file with tables and additional information. During creation of the templates I continously needed to switch between the PDF and between the (smart) XML editor I am using. I needed to copy the variable name, role and datatype and then paste it into the template. I then also needed to add whether the variable is mandatory (required/expected) or not (permissible).

The validation whether everything was correctly implemented was purely visual, comparing a view of the machine-readable template with the tables in the PDF file. Another frustrating task...

In earlier days, the SDTM team also published these tables in an Excel file. Although not a vendor-neutral format, this Excel file at least allowed to automate some things. The way I proceeded was as follows:
First I read the Excel file into OpenOffice Calc (a competitor of Excel). Then I exported the tables as an OpenOffice "odf" file, which is essentially a zip file containing a set of XML files. After renaming and unzipping, one needs to take one of these XML files, and transform it to CDISC define.xml using a stylesheet.
Although each new publication of the SDTM team had differently formatted Excel files, meaning that I needed to adapt the stylesheet to transform to define.xml each time, it was a very good way, and I could generate the templates in a few hours.

The CDISC SDTM team does not publish their tables in a machine-readable format anymore. I do not know why, because I cannot imagine that the team is not using any computer tools to help in generating the tables. So why not publish them in a machine-readable format?

SDTM table definitions in a machine-readable format - that would be cool!

SDTM Supplemental Qualifiers

From the document "Proposal for Alternate Handling of Supplemental Qualifiers" of the CDISC SDTM Team:
"The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the Non-Standard-Variables (NSVs) together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers. This document describes a proposed method by which NSVs and standard variables could be represented together in the parent datasets".

Or in other words: FDA reviewers are uncapable of merging/combining datasets (tables) so that the non-standard variables appear in their original table!

Oh my God: even my undergraduate students would be able to do that!

But why haven't the non-standard-variables not always be kept in their original table? Because they would then not be distinguishable from the standard variables? Yes, it is true that SAS Transport 5 does not have the capability of keeping metadata so that standard-variables and non-standard-variables can be distinguished, except for by naming conventions.

Wait a minute!
Metadata belong in the define.xml isn't it?
So why not just state in the define.xml that a variable is a non-standard variable? For example, suppose we have a non-standard variable "QSLANG" ("Question Language") in the QS (Questionnaires) domain. In the current SDTM implementation it is "banned" to the SUPPQS dataset.

We just keep that variable in the QS dataset and write in the define.xml:

<ItemGroupDef OID="QS" Name="QS" ... def:Label="Questionnaires" ...>
    <ItemRef ItemOID="STUDYID" Mandatory="Yes" OrderNumber="1"/>
    <ItemRef ItemOID="DOMAIN" Mandatory="Yes" OrderNumber="2"/>
    <ItemRef ItemOID="USUBJID" OrderNumber="3"/>
    ....
    <ItemRef ItemOID="QS.QSLANG" OrderNumber=".." def:IsNonStandardVariable="Yes"/>
</ItemGroupDef>

and of course define the corresponding ItemDef with OID "QS.QSLANG":
<ItemDef OID="QS.QSLANG" Name="QSLANG" def:Label="Question Language" DataType="text" Length="2"/>

DONE!

Does this help the reviewer in distinguishing between standard variables and non-standard variables?
Of course not very much. The FDA reviewers usually print out the define.xml (using the stylesheet view) and then inspect the XPT datasets in parallel. Not very modern isn't it?
The simple reason is that the viewer cannot read define.xml files.

Suppose however that the datasets do not come as XPT files, but as CDISC ODM (XML) files.
In CDISC ODM there is always a strong relation between the variable definition (ItemDef) and the data value (ItemData) over the OID. So for example, we would have (extending the above example):

<ItemDef OID="QS.QSLANG" Name="QSLANG" def:Label="Question Language" DataType="text" Length="2"/>

and the value:
<ItemGroupData ItemGroupOID="QS">
    <ItemData ItemOID="STUDYID" Value="STUDYX"/>
    <ItemData ItemOID="DOMAIN" Value="QS"/>
    <ItemData ItemOID="USUBJID" Value="P0001"/>
    ...
    <ItemData ItemOID="QS.QSLANG" Value="en"/>
</ItemGroupData>

This allows us to create software that tabulates the data and e.g. colors it depending whether it is a standard variable or a non-standard variable. This is very easy. It even can be done using an XSLT stylesheet when one wants to display the data in a browser.

It even creates new opportunities, such as:
- the software/stylesheet can easily validate the data, and e.g. color cells in the table (with an additional tooltip) for which there is an SDTM rule violation. For example an empty "QSREASND" can be colored red when the value of "QSSTAT" is "NOT DONE", this as the rule is that QSREASND may not be empty when QSSTAT is populated.
- In the LB (Laboratory) domain, values of "LBSTRESN" can be colored when the value is outside the reference range, i.e. lower than the value from LBSTNRLO or higher than the value of LBSTNRHI.
- rows that contain baseline data (--BLFL=Y) are colored differently
- data is sorted based on more than one criteria

All this is currently NOT possible using the SAS XPT Viewer, because it cannot read the define.xml containing the metadata.

There are much more opportunities: the number of them is only limited by your creativity.

Once we know how the future(?) ODM-based XML format for SDTM (and SEND and ADaM) data will look like, we will pilot such software and demonstrate it.
So ... keep tuned!

Thursday, November 8, 2012

SDTM databases and the FDA

Two days again, my (first year) bachelor students had their "Introduction to Databases" exam.

When looking for a good question on "CREATE VIEW" (or joins in general) I was thinking again about SDTM and whether it is suitable as a database. After all, the FDA has done several attempts to create an SDTM datawarehouse, and as everyone having some database skills knows, you cannot (or almost not) create a datawarehouse without having one or more databases.

So I came to the following exercise:

"Given the following tables:


write an SQL statement to generate the following result table:"

and then I gave them a picture of the result SDTM table for the Laboratory "LB" domain.

(P.S. a correct answer is: "
CREATE VIEW Laboratory AS
SELECT t.STUDYID, t.DOMAIN, t.USUBJID, t.LBSEQ, t.LBREFID, t.LBTESTCD, tc.LBTEST, tc.LBCAT, v.VISITNUM, v.VISIT, v.VISITDY FROM Laboratory_test t, Testcode tc, Visit v WHERE t.LBTESTCD = tc.LBTESTCD AND t.VISITNUM = v.VISITNUM; "  )

This brought me to the following thoughts: "if we submit our SDTM datasets as essentially being a "View" on an SDTM database (see a previous contribution), how can the FDA reconstruct a database from this?"

After all, they want to use this kind of data in a datawarehouse, so they need to start from databases. If they would like to reconstruct the database from the (sas transport) tables, they do need to split the "view" (e.g. the LB sas dataset) into 3 or more tables.
Or can one start from the "view" to populate a datawarehouse?

Splitting up a "view" table into 3 (or more) tables so that the first, second and third normal forms are obeyed, and doing this in an automated way, does not look to be very simple to me.
Just suppose that there is one inconsistency in the "view" (sas dataset) e.g. that for the same LBTESTCD (e.g. "BILI") there is more than one corresponding value for LBTEST (e.g. once "Bilirubin" and one "Billirubin"). What would then happen?

But we do not know whether the FDA really tries to reconstruct the original database, or that it just uses the SDTM (sas) tables "as is".

Any comments (especially from FDA people) are of course very welcome.