Sunday, July 13, 2014

Why SDTM should NOT contain --TEST as a variable

All the findings domains in the SDTM have both --TESTCD (test code) and --TEST (test name) variables. There is a pure 1:1 relation between --TESTCD and --TEST: for each unique value of --TESTCD there is a single unique value of --TEST. For example for LBTESTCD=GLUC, only LBTEST=Glucose is allowed.

Here is a view from a sample SDTM submission:


Nice isn't it? But did you notice that there is an error? There is an LBTESTCD="FRUCT" with LBTEST="Glucose". Would a reviewer really notice? Can a machine easily find out that this is an error?

Although there is a 1:1 relation between both, CDISC published codelists for both separately. So there is a codelist for LBTESTCD and another for LBTEST. A bit strange isn't it?
The 1:1 relation between individual terms is then established by the "NCI code". For example, both "GLUC" (in codelist LBTESTCD) as well as "Glucose" (in codelist LBTEST) have the same "NCI code" which is "C105585". So if one wants to validate whether a test code and test name really fit together, then one needs to go over the NCI code, and that is essentially what OpenCDISC is doing.
Of course, this leads to trouble when extending a codelist with own terms for --TESTCD and --TEST. Using the "CDISC way" there is no good way in define.xml to state that an added --TESTCD belongs together with an added --TEST value. An example could e.g. be "FRUCTO" and "Fructose" which are currently not in the CDISC-CT, and which need to be added to two separate codelists.
It was also found by my colleague Dave Iberson-Hurst that this approach (linking over the NCI code) has led to versioning issues, i.e. terms changing names between versions without notice!
Also, we found a few cases where this linking mechanism leads to false/wrong values for --TEST.

Although seeing the test name for each given test code is a nice feature for the reviewer, one should ask oneselve whether --TEST should really be submitted in the datasets themselves, as this not only violates the third normal form for good database design (see my previous posts) but also blows up the sizes of the data sets themselves. I estimate that data sets could be approximately 20% smaller when --TEST would not be submitted.

There are two major types of solutions for resolving these issues.

The first is to recognize that this 1:1 relation exists, and that --TEST is essentially metadata (data about data) and that the codelists for --TESTCD and --TEST are essentially one codelist, meaning that they should be merged. This can be done using the classic CodeListItem mechanism in define.xml with a "CodedValue" and a "Decode". For example:


A viewer can then retrieve the "Decode" value from the define.xml and display it in a column --TEST that is generated by the viewer itself (so --TEST is NOT in the submitted data set). In databases, this corresponds to a JOIN between two tables (one with data and one with metadata).

If a company sticks to published CDISC-CT, then a second solution comes into play: web services, i.e. CDISC is publishing the controlled terminology and makes it available as a web service, e.g. using REST or SOAP (this could be done through SHARE). A viewer tool (or any other software)  then retrieves the value of the submitted --TESTCD (e.g. GLUC) and then looks up the value of the corresponding --TEST (test name) using the web service.
One of our students, Mr. Wolfgang Hof has designed and implemented such a web service on a local server at the university. He also implemented it (client side) in our "Smart Dataset-XML viewer": when the user hovers the mouse over a --TESTCD value, the web service is triggered, a remote database queried, and the details about the test are displayed as a tooltip:



Both approaches essentially correspond to the idea that tools should retrieve metadata for data, and that metadata are kept separated from the data themselves (as is also done in good database design).
An error such as in the above example can then not occur anymore...
So if we following this "good practice" principle, we do not need --TEST anymore.

Let's take the next step and throw --TEST out of the SDTM. It is metadata, not data!





Friday, April 11, 2014

Using UCUM units for CDISC-SEND



At the European CDISC Interchange we once again discussed replacing CDISC controlled terminology for units by UCUM, the latter being the worldwide standard for units, and used everywhere in healthcare and in electronic health records (mandatory in HL7-CDA).
 
Once again, I got the objection that UCUM is not usable in SEND (Standard for Exchange of Nonclinical Data) which is especially about preclinical research using animals, bacteria etc..
 
The statement about UCUM not being usable for SEND is just not true. Most people coming with this argument usually haven't read the UCUM specification, so I will explain here how UCUM can (and should be) used for SEND.
 
Consider the following "unit" used in SEND:  g/animal/day
 
it contains 3 components, one not being a unit at all: "animal". As unfortunately in all CDISC-CT for units, objects or properties (like "animals") have been mixed up with real units. But sometimes this is just necessary to compensate for basic errors made in the SDTM and SEND models.
Now, as this also happens in other industries and sciences, UCUM has developed a mechanism for dealing with this. It is called "annotations". So the UCUM notation for "g/animal/day" is:
 
g/{animal}/d
 
with the "topic" (what it is about) of the unit being in curly brackets, the "annotation".
Also remark that „day“ is written in another way as in the CDISC notation.

One of the advantages of using UCUM is that it comes with a machine-executable XML file that describes all the units and their relations (ucum-essence.xml). So UCUM "knows" that a day is 24 hours, that an hour is 60 minutes and so on, and that all in a machine-executable way. So if you would want to calculate how many N g/{animal}/d is in „milligram per animal per minute“ this can be fully automated (which is impossible when using CDISC-CT).
 
The next objection of the SEND people followed immediately: "yes, but you can write anything between the curly brackets (defining the that part is an 'annotation'), so we do not have any control anymore what people will submit at all".
Well, the answer to that is pretty easy: CDISC should not control "units" like "g/animal/day", it should controll the annotations. So instead of publishing an ever growing lists containing things like "g/animal/day", they should be publishing lists of allowed "annotations". I picked out a few examples from the current SEND-CT that could be done:
{animal}
{cage}
{CAPSULE}
{BAR}
 
remark that "bar" has in CDISC-CT a totally different meaning than in the rest of the world: a "bar" in CDISC-CT is a unit of packaging like "a bar of chocolate" whereas for the rest of the world it is a unit of pressure. By making clear that it is a "UCUM annotation" however (i.e. putting it in curly brackets), it is immeditely clear that it is not a unit for pressure, but something else, and can even be used parallel with the real "bar" unit.
 
Another advantage of having CDISC control over the annotations instead of the "units list" itself is that it allows for flexibility. For example if someone (e.g. an investigator) needs to have a unit "animals per cage", a request must be made to CDISC to extend the "UNIT" codelist with a new term, which takes months, with the possibility that the new term request is turned down.
 
When using UCUM however, with CDISC having controll over the annotations, the term can be used immediately as "cage" and "animal" are already in the list of allowed annotations. So the investigator can just use:
 
{animal}/{cage}
 
which is a valid UCUM unit.
 
Now the investigator realizes that "animals per cage" is not very precise. He/she has chickens, so what is important for the "density" (a UCUM property by the way) is the number of chickens per square meter. Instead of needing to request for a new term once again, he/she can simply use:
 
{animal}/m2
 
which is again a valid UCUM unit with the additional advantage that a computer can immediately and fully automatically calculate how many "animals per square yard" that is.
 
The investigator however also works with birds that can really fly (in contrast to chickens). So in that case, the density is better defined by the number of birds per cubic meter. Without needing to do a request for a new term again, he/she can now write:
 
{animal}/m3
 
which again is fully automatically interconvertable to e.g. „number of animals per gallon“.
 
So my proposal to CDISC is: discontinue the development of this ever growing list of units (that are not units) and that are not interconvertible by computers. Start using UCUM and publish lists of allowed annotations. For each SEND (but also SDTM) variable, CDISC can then publish a list of (one or more) "strongly preferred" units. For example for "height of subjects" (DM.HEIGHT):
cm
m
[in_i]
 
(the latter is UCUM for "inches"). This set is a valid set of UCUM units which are fully interconvertible by computers (UCUM „knows“ that an inch is 2.54cm – CDISC-CT does not have that in machine-executable code)
 
Or for a SEND variable that describes the amount of food for the animals:
 
g/{animal}/d
g/{animal}/wk
 
which are all valid UCUM units with the additional advantage that even when the investigator has been collecting the amount of food as "gram per animal per month" (g/{animal}/mo) this can be fully automatically recalculated in one of the above.
 
 Comments are very welcome as always
 

Saturday, September 21, 2013

Is CDISC Controlled Terminology going the wrong way? A follow up

So I made some "change requests" to the CDISC-CT ...

Here are the answers (in italic) that were send to me by the CDISC-CT team.

"The CDISC Terminiology team met to discuss your requests. Here are our decisions:

a) pH - Team agrees to remove from Unit codelist.

b) BEATS/MIN - This is a common clinical unit of measure for heart rate and pulse rate. It is classified as a unit of measure by SNOMED-CT. It is classified as a unit of measure by NCPDP (National Council for Prescription Drug Program). Therefore it will not be removed. 

 c) BAR - For CDISC, both units of measure and units of presentation are in the same codelist. It is up to the user based on the definition to select the correct term. This term is published in the FDA SPL list. We understand your concern regarding units of measure versus units of presentation. We have included units for presentation from ISO 11616 as distinguished from units of measure. If BAR for pressure is needed for clinical trial work, please submit a request to add that as a new term. Currently we will leave "BAR" as a dosing unit. 

Concerning a) thank you - there seem to be a few chemists left on this world
Concerning b) essentially says: we copy the errors of others
Concerning c) oh my God ...

How does UCUM deal with the "beats/min" issue?
From the specification I read:
"§6 curly braces 1) The full range of characters 33–126 can be used within a pair of curly braces (‘{’ and ‘}’). The material enclosed in curly braces is called annotation.
2) Annotations do not contribute to the semantics of the unit but are meaningless by definition. Therefore, any fully conformant parser must discard all annotations. Parsers of limited conformace should not value annotations in comparison of units.
3) Annotations do, however, signify the end of a unit symbol.
4) An annotation without a leading symbol implies the default unit 1 (the unity).
5) Curly braces must not be nested.
Curly braces are here because people want annotations and deeply believe that they need annotations. Especially in chemistry and biomedical sciences, there are traditional habits to write annotations at units or instead of units, such as “%vol.”, “RBC”, “CFU”, “kg(wet tis.)”, or “mL(total)”. These habits are hard to overcome. Any attempt of a coding scheme to restrict this percieved expressiveness will ultimately result in the coding scheme not being adopted, or just “half-way” adopted (which is as bad as not adopted).
Two alternative responses to this reality exist: either give in to the bad habits and blow up of the code with dimension- and meaningless unit atoms, or canalize this habit so that it does no harm. The Unified Code for Units of Measure canalizes this habit using curly braces."


So, for a heart beat frequency, the correct UCUM unit is "/min", but also ("canalized bad habit") the following would be allowed "{beats}/min".

For "units" such as "bar", "bottle", "capsule" I have now proposed the team to move these into a separate codelist, so that these are really separate from "measument units" (according to wikipedia: "standardized quantity of measurement of a physical quantity").
This would strongly improve machine-processability of SDTM records: a machine can then be explained that "units" of this codelist can NOT be used for comparisons. With the current situation (mixing up of standardized units and non-standardized units) such machine-processability is impossible (or extremely dangerous).

I will follow up as soon as I have a response from the CDISC-CT team.

Wednesday, July 10, 2013

Is CDISC Controlled Terminology going the wrong way?

Today, I downloaded the latest (June 2013) CDISC controlled terminology (CDISC-CT) from the NCI website.
I am already communicating (well, it's more a one-way communication) with the CDISC-CT team for about a year now, as I do have strong concerns about the direction CDISC-CT is taking.

As you might know, CDISC-CT is just a list. It does not have a system (unlike LOINC), nor it is a taxonomy (like SNOMED-CT). It's just a list, with all imaginable problems.

Here are my major concerns:
- the CT does not show at all that there is a 1:1 relation between --TESTCD (test code) and --TEST (test name). There are separate lists for each of them, as if these were unrelated.
- CDISC-CT reinvents the wheel: some codelists like those for lab tests, units of measure, and microorganisms are unnecessary as there already exists well established CT used by healthcare. For lab tests there is LOINC, for units there is UCUM, and for microorganisms there is the "International Code of Nomenclature of Bacteria" (ICNB). All these have a system or are taxonomies, so why try to replace them with something basic like a simple list? Does CDISC-CT suffer under the "not invented here syndrome"?
Some within the CDISC-CT team claimed that their codelists are subsets of the mentioned ones. I checked, and it simply is not true! Others told me the CDISC-CT can easily be mapoed to the mentioned ones. I checked and it simply is not true! Due to the system (or better: lack of system) CDISC-CT is using, these lists will grow to almost infinity, and the length of the latest version (>7000 terms) strengthens my opinion on this. I heard of one sponsor who submitted over 500 "new term requests", only for lab tests.
So why not use something (LOINC, UCUM, ICNB) that is used by everyone in health care and that has proven its merits?

There are also some peculiarities in the CDISC-CT list that draw my attention. What to think of the following:
- the unit of pH is ... PH!
In my first year at high school (well, I do also have a PhD in chemistry) I learned that pH is a property that is dimensionless, i.e. it does not have a unit.

- bar: from the same teacher (my chemistry teacher was also my physics teacher), I learned that "bar" is a (even SI) unit for the property "pressure". According to CDISC-CT "bar" is defined by "A dosing measurement based on the bar unit". Do you understand what is meant here? I don't. Is this a circular definition (also not a definition). Is "bar" meant for something like "a bar of chocolate"? If so, is that a unit? Is a bar of chocolate the same amount in Belgium as it is in the US or Tokyo?
Under "unit" I understand that it is the same amount everywhere in the world (well, a "gram" is).

- beats/minute: this is a more subtle one. The definition in CDISC-CT is: "the number of heart beats measured per minute time". Makes sense, isn't it? No: it does NOT. First of all, the definition - is "beats/minute" limited to "heart beats"? I don't think so, I usually also use it for drum beats.
Secondly, the CDISC-CT mixes up the property (heart beats) with the units. A heart beat is measured as a frequeny, and some popular units for frequency are "/min" and "/sec".

- "bottle", "box", "can", "capsule", ... Is a "bottle" a "unit"? According to CDISC-CT, it is. I have a lot of bottles at home, and most can contain different amounts of liquids.
CDISC-CT is mixing up the concept of "object class" (in this case the parent class is "container") with the concept of "unit" (unity amount that is the same everywhere in the world).


I shared all these concerns with the CDISC-CT already more than half a year ago, but they just do not seem to care. The CDISC-CT-LAB subteam promised me to write up an answer to my suggestion not to define their own codelist, but to use the existing LOINC standard instead about 5 months ago, but I still haven't received it (though I send several reminders).

Do not misunderstand me - I am not against the development of CDISC controlled terminology! We really need CDISC-CT to be all "on the same wavelength" and especially to be able to compare values between different submissions of different sponsors.
But if we do, we need to do it correctly, and not try to reinvent the wheel. Nor should we forget our basics in physics, chemistry and especially in semantics, just for the sake of trying to controll everything.

Sunday, April 21, 2013

Alternatives for SAS Transport 5 - 20 years ago

I have been often criticized that I criticize SAS Transport 5 without proposing a better alternative.
This blog contribution is a response to that.

Many many years ago (I haven't been able to find out exactly how many years ago - If you know it, please let me know), the FDA decided for SAS Transport 5 (".xpt" format) for submission of datasets. It is not clear to me how and why they came to this decision. I discussed it with several people that were already around at that time, but their statements are not unambiguous.
It is currently however clear that this was not a good choice, but that is an easy statement many years later.

First of all, is the choice for transport format important?
Essentially, SDTM is a semantic standard: it is about content.
SAS XPT, XML, CSV and others are transport formats, you can transport any kind of data with them.
So, one could transport SDTM data using any transport format, as the two are essentially independant. Stating e.g. that HL7-v3 is a better format for transporting SDTM data as it allows integration with healthcare is nonsense! It is the same as stating that "using the same truck for transporting cows and for transporting oranges allows us to breed cows that produce orange juice".
However, we cannot expect from the FDA that it accepts SDTM (and SEND and ADaM) data in any transport format - a choice (only one, or maybe a few) must be selected.

So what could have been the alternatives 20 years ago?
First we must take into account that at that time we did not have define.xml for the metadata (XML was still in it's early stages and now well known): metadata was usually submitted as a text file or PDF.

Let us look at few alternatives ...

A. CSV
Comma-separated (or similar) format has the advantage that it is very compact. It is for example (using a vertical bar as field separator) used in HL7-v2 messages with a lot of success.
I recently generated some CSV files from SDTM XPT datasets and for example found that the LB dataset from the FDA Pilot "LZZT" study was reduced by a factor 6 in file size. "Great" you would say when you think about the "file size complaints" of the FDA. However CSV also has some major disadvantages. First of all: how to declare the variable names and labels? That can be solved by adding two extra lines at the top of the document containing them. But what about associated codelists for coded variables (not to speak about valuelists)? That is much more difficult. Of course one can add extra CSV datasets for the coded values (enumerations) too, but how can one make the link between both?
I think it can be done somewhat artificially, but it is not ideal.

B. SQL
SQL is a data definition and query language for relational databases. It is standardized as ANSI-SQL and already very long in place and very much used. In a previous blog I wrote that SDTM is not to be considered as a database, but as a "View" on a database. So instead of obtaining the database itself, the FDA has decided that we must submit "Views" on the database. This has many disadvantages, as in a "View" it is much more difficult to detect data inconsistencies, and it is difficult (if not impossible) to 100% reconstruct the original database. The choice for the "View" surely was surely related to the choice for SAS Transport 5.
So why not submit the database instead of the view on the database? It would have many advantages, as well for the submittors as for the FDA itself.
How can we exchange a database in a vendor-neutral way (one of the requirements)? The answer is "ANSI SQL". The design and metadata of the database could be exchanged using ANSI-SQL "Create Table" statements, with primary key and foreign keys and everything, so also defining the relations between the tables. The actual data could then be exchanged using "SQL Insert" statements.
What would be the advantages? First of all we see that SDTM is in each new version containing new and more derived variables. For example, RFXSTDTC (Date/Time of first Study Treatment) is also already present in the EX (Exposure) dataset. RFXSTDTC was only added to DM because the reviewers are unable to combine both datasets (they just use the viewer which is not "understanding" SDTM). Using ANSI-SQL we could (and we MUST - database "normal forms" require this) get rid of redundant information, which would mean that our data is much more consistent.
Once the FDA would have received the "SQL Create Table" and "SQL Insert" datasets, it could easily import them into a database (the first step in using a data warehouse by the way), with consistencies being checked over the foreign keys. They could then create their OWN views on the data (ANSI-SQL "Create View", such as views very similar to the current SDTM datasets.
This would mean that much less data would need to be submitted (I guess a 50% reduction or more) and that the FDA could easily create any possible views on the data, something they currently cannot do.

Remark that it is still possible to keep the "vertical" table structure as in the SDTM (in the database world, this is called the "entity-attribute-value" (EAV) model. It would also allow to keep the supplemental qualifiers within their parent dataset (a major problem in the current SDTM implementation).

C. Any other?
Would you have had another choice for a vendor-neutral transport format for dataset submission 20 years ago? Please let me know or write a comment.


My personal choice:
If I would have been the FDA twenty (or so) years ago, I would have chosen for ANSI-SQL as a vendor-neutral data submission format. But I am not the FDA, and I was not working in clinical research yet at that time. I do not know whether it has ever been considered (if you know, let me know).

So why not ANSI-SQL as the new transport format today?
Data exchange between computers and especially databases is usually not done using SQL nowadays. The reason is that we have a better alternative, namely XML. XML is not limited to representing flat, two-dimensional data, but has much more features. In a world "that is not flat", it can be used for representing many-dimensional data (though SDTM currently remains "flat" - will it after SHARE?) and it can also be used to e.g. visualize data (using stylesheets) something that cannot be done using the SQL format.

Your comments are welcome, as usual!


Why SUPPQUAL sucks

Following up on my previous post on NSVs (Non-Standard Variables) which are banned to SUPPQUAL datasets in SDTM, I was wondering why it would be so hard (for the FDA) to recombine such NSVs with their parent record (or group of records) in the parent domain.
So I tried it myself.

We are currently working hard on an XML based format for SDTM (and ADaM and SEND) submissions. It is based on ODM "ClinicalData" and "ReferenceData". You can have a sneak preview here. We (one of my Masters student and I) am also working on a "smart" viewer for such data. I will report on the "Smart SDTM Viewer" later.
In the new XML format, we support that NSVs remain in their parent record, i.e. there will just be an "ItemData" for the NSV within the parent original "ItemGroupData" which contains also all the standard variables. In the define.xml, the corresponding metadata "ItemDef" then just has a flag - we use Role="SUPPLEMENTAL QUALIFIER".
Also the "Smart SDTM Viewer" supports this: such columns are simply colored differently.

As we needed test files for new format and the viewer, I took the recently published "Updated Pilot Submission Package" SAS Transport files (available from the "members only" part of the CDISC website) for the famous LZZT study. I converted them into our XML format and then tried to recombine SUPPXX datasets with their parent dataset. For doing so, I wrote an XSLT-2 stylesheet allowing two XML files to be recombined into a single one.

Everything went well for the SUPPDM, SUPPDS and SUPPAE datasets. Trouble started with SUPPLB.
The latter has over 64000 records and has a size of 55MB. The parent LB.xpt dataset has almost 60000 records and has a size of 33MB. This means that there is somewhat more than 1 NSV per LB record.
The fact alone that the size of SUPPLB is larger than that of the parent dataset shows that this format is not very efficient.
Now I started to run the transformation and waited, waited ... waited. I went to bed late in the evening and was surprised the following morning that the transformtion was still running.
Ultimately, it took over 2 days to recombine SUPPLB with LB.

So I looked into my recombination algorithm and tried to improve it. I soonly recognized there was not much room for improvement.

The problem came up again when working on the "Smart SDTM Viewer". It has a feature that when clicking a record in a SUPPXX "tab" or a record in the "RELREC" or "CO" "tab" in the viewer, one can be navigated to the parent record with a few mouseclicks only (try this using the SASViewer...). For FDA reviewers, I presume this will be a "killer feature".
Now I wanted to do the opposite, i.e. each record in the parent dataset that has one or more corresponding records in either a SUPPQUAL, CO or RELREC dataset would be marked (e.g. using another background color). In order to do so, a lookup must be done for each record in SUPPXX, RELREC and CO datasets.
What is the problem?

If you look at the structure of SUPPQUAL (CO and RELREC are very similar), you see that there is an IDVAR ("identifying variable") and IDVARVAL ("identifying variable value"). IDVAR can in principal be any variable from the parent dataset. Usually one chooses for --SEQ, but other popular identifiying variables are --SPID, --CAT, or even --VISIT or --VISITNUM.
Now when starting searching from the SUPPQUAL record, it is easy, as you know (thanks to IDVAR and IDVARVAL) what to look for in the parent domain.
When starting searching from the parent dataset and record however, you do not know (and the computer either) what to look for. You will probably start with USUBJID, and select all records in the SUPPQUAL dataset for that USUBJID. Then for each of these, you read what IDVAR and IDVARVAL are and check whether these correspond with the selected source record in the parent dataset. If there is a match, you create the NSV "ItemData" and add it as the last variable in the parent record (ItemGroupData). If you have 60000 records in the LB dataset, and 64000 in the SUPPLB dataset, all these lookups take a lot of time.

Conclusion: the way SUPPQUAL was developed easily allows to do a lookup from the SUPPQUAL record for the parent record. The other way around however is much more difficult, and will always take a lot of computer time if it has to be done for each record in large datasets such as the LB dataset.
So it is no wonder at all that FDA reviewers are unable to view NSVs together with the parent data in their standard review tools.

I do not know what was in the mind of the developers of SUPPQUAL at that time, but I would now consider it as a major design error.
There could have better choices, which I will explain in a future blog.

Saturday, February 2, 2013

SDTM and non-standard variables

The new draft SDTM-IG 1.4 package 2 contains an interesting statement in the "SDS Proposal for alternate Handling of Supplemental Qualifiers" (file: SDS_Proposal-Alternate_Handling_of_Supplemental_Qualifiers.pdf) document. I cite:

"The Supplemental Qualifiers structure was created to address the need to represent NSVs in the SDTM. It consists of a normalized data structure designed to allow for standard representation of what often is a wide variety of sponsor-specific variables.
The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the NSVs together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers".

Where "NSV" means "Non-standard variables".

Essentially the above statement says that reviewers are not able to combine variables in SUPP-- domains with the parent domain and bring them "back" to the original domain.

I am not surprised.

In conversations with FDA representatives, I heard that SDTM data is viewed by reviewers "as is". Usually there is no attempt to generate a database from the submission datasets. Without that, it is not easy to recombine SUPP-- datasets with the parent domain.
Obviously, there is not much database knowledge at the FDA. I have always regarded their attempt to come to a datawarehouse as "a bridge too far". If they are not able to generate a database, how would they be able to generate a datawarehouse?

The proposed solution that is provided in the above mentioned document is not new: the ODM team already proposed something very similar five years ago, but the teams ideas for an XML-ODM based replacement of SAS Transport 5 were vetoed by the FDA.
I did already write about that in the blog "SDTM Supplemental Qualifiers".

Generating a database from the submission datasets is not made easy by the structure of the SDTM. I already wrote a previous blog on this topic titled "Is SDTM a database and if so - is it a good one".

At the beginning, SDTM was developed to contain "collected data only", and only a minimal amount of "derived data" would be present. The idea was that derived data would go into ADaM.
It however soon was found out that derived data would need to go into the datasets themselves. A simple example is "AGE" in the Demographics (DM) dataset.
AGE is in most cases never collected - it is derived (calculated) from BRTHDTC (birth date) and RFSTDTC (reference start date). Both these are present and required in the DM dataset. So why is AGE than still necessary?
IF SDTM were a database representation, "AGE" would even not be allowed, as it violates the "third normal form", stating that "transitive dependency" is not allowed.
However I suspect that, as it was soon found out that the FDA could not, or did not want to generate a database from each submission, "AGE" was included in the SDTM as a variable of the DM domain.

This turned SDTM into a "View" on a database.
Regenerating a real database from a view on it, is however not an easy task.

In the last few years, we have again and over again seen that new variables are added to the SDTM. In many cases these are derived variables.
So for example, SDTM 1.3 / SDTM-IG 3.1.3 adds following variables to the DM domain:
  • RFXSTDTC: date/time of first study treatment [can however also be obtained from the EX dataset]
  • RFXENDTC: date/time of last study treatment [can however also be obtained from the EX dataset]
  • RFICDTC: date/time of informed consent
  • RFPENDTC: date/time of end of participation [can very probably be obtained from SE (subject elements) and/or from DS (disposition)
  • DTHDTC: date/time of death: this is clearly derived - I cannot image this is a standard field on the Demographics form. I haven't seen it in CDASH either
  • DTHFL: death flag - similar
  • ACTARMCD: actual arm code - this really does not belong here. It should be a variable in the SE (subject elements) domain. But it looks as it was put in DM as either reviewers do not inspect the SE datasets and/or cannot combine it with the DM dataset.
The most obvious example of a derived variable is "--TEST". There is a 1:1 correspondence between "--TESTCD" and "--TEST", clearly violating the 3th normal form of database design. The somewhat crazy thing is that in define.xml this is not taken into account: there are separate codelists for "--TESTCD" and for "--TEST". As "--TEST" is a synonym qualifier for "--TESTCD", the correct construct should be like:
<CodeListItem CodedValue="SYSBP">
    <Decode><TranslatedText>Systolic blood pressure</TranslatedText></Decode>
</CodeListItem>

and "--TEST" should even NOT be in the SDTM dataset.
In a tool that can as well look up the metadata as the data itself, the column "--TEST" would either be generated "on the fly" or the test name (value of --TEST) could e.g. appear as a tooltip when the mouse is positioned over the cell with the test code. With the ancient SASViewer this is of course not possible (as it even does not understand define.xml).

I also have always been puzzled about why we both have --STRESC and --STRESN. For --STRESN the IG says: "copied in numeric format from --STRESC". If the value in --STRESN is a copy of the value in --STRESC, why do we need it then? Maybe to explain the reviewer that the value is a numeric one? But that is already stated in the define.xml (ValueList / ItemDef) isn't it?

So why all these derived variables?

I suspect that the reason is very simple: the FDA is not able to do anything with the submitted datasets than to inspect them with the ancient SASViewer. They are not able to combine datasets with each other (generate views on views), they are not able to generate databases from them.
So each time they need a new "view" they just ask CDISC to define a new (derived) variable that is then added to the standard. The more information the FDA will want to retrieve from the datasets, the more new (derived) variables they will ask for to be included in the SDTM standard.
As can be expected, this will lead to even more "huge" datasets (as the number of variables is continously increasing), and we are then surprised that the FDA is complaining about file sizes!

Is there a way out of this?

I think there is once we have SDTM in XML (ODM-based) format. When both define.xml and SDTM datasets use the same model/format (ODM), we can generate software (e.g. viewers) that allow to "look up" information, or "calculate it on the fly".
For example, for the DM domain, suppose we leave "AGE" out of the dataset. The viewer can then take care that "AGE" is calculated "on the fly" and displayed either as an additional column (although it was not in the dataset), or as a tooltip for one of the cells (e.g. tooltip on USUBJID or on BRTHDTC).
Similarly, for the Findings domains, suppose we leave "--TEST" (test name) out of the dataset and only provide "--TESTCD". As both are connected to each other through the "code / decode" in the metadata, the column "--TEST" can automatically be generated when loading the data from the information in the define.xml. Or the value for --TEST can be supplied as a tooltip on the "--TESTCD" cell in the table.

Once we have such an XML-based format, we can start thinking about deprecating a number of SDTM variables, as it will essentially be the tool (e.g. the viewer) that takes care that they are derived, either from within the same dataset (e.g. AGE) as from other datasets (e.g. RFXSTDTC, RFXENDTC) as from the metadata in define.xml (e.g. --TEST - test name).
This will bring us to a much cleaner SDTM with much less redundant information, and considerably smaller files (I estimate a 50% reduction).

The ultimate solution is however SHARE. Once we can allocate each data point as being a SHARE object, things will become much easier.
There has been a lot of discussion about a format for SHARE. I am however more and more convinced that the format is completly unimportant. I am more and more convinced that SHARE objects can be expressed as RDF, as define.xml and ODM, as Excel (if you like that).

But that's a topic for another blog entry ...