Yesterday and today, I worked on implementing the new SDTM oncology and a number of the new draft SDTM-IG 3.1.4 domains in my SDTM-ELT(TM) software. For each new domain, I created a template define.xml file that can be read by the software.
It was so frustrating!
The new domains come as a PDF file with tables and additional information. During creation of the templates I continously needed to switch between the PDF and between the (smart) XML editor I am using. I needed to copy the variable name, role and datatype and then paste it into the template. I then also needed to add whether the variable is mandatory (required/expected) or not (permissible).
The validation whether everything was correctly implemented was purely visual, comparing a view of the machine-readable template with the tables in the PDF file. Another frustrating task...
In earlier days, the SDTM team also published these tables in an Excel file. Although not a vendor-neutral format, this Excel file at least allowed to automate some things. The way I proceeded was as follows:
First I read the Excel file into OpenOffice Calc (a competitor of Excel). Then I exported the tables as an OpenOffice "odf" file, which is essentially a zip file containing a set of XML files. After renaming and unzipping, one needs to take one of these XML files, and transform it to CDISC define.xml using a stylesheet.
Although each new publication of the SDTM team had differently formatted Excel files, meaning that I needed to adapt the stylesheet to transform to define.xml each time, it was a very good way, and I could generate the templates in a few hours.
The CDISC SDTM team does not publish their tables in a machine-readable format anymore. I do not know why, because I cannot imagine that the team is not using any computer tools to help in generating the tables. So why not publish them in a machine-readable format?
SDTM table definitions in a machine-readable format - that would be cool!
Saturday, December 29, 2012
SDTM Supplemental Qualifiers
From the document "Proposal for Alternate Handling of Supplemental Qualifiers" of the CDISC SDTM Team:
"The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the Non-Standard-Variables (NSVs) together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers. This document describes a proposed method by which NSVs and standard variables could be represented together in the parent datasets".
Or in other words: FDA reviewers are uncapable of merging/combining datasets (tables) so that the non-standard variables appear in their original table!
Oh my God: even my undergraduate students would be able to do that!
But why haven't the non-standard-variables not always be kept in their original table? Because they would then not be distinguishable from the standard variables? Yes, it is true that SAS Transport 5 does not have the capability of keeping metadata so that standard-variables and non-standard-variables can be distinguished, except for by naming conventions.
Wait a minute!
Metadata belong in the define.xml isn't it?
So why not just state in the define.xml that a variable is a non-standard variable? For example, suppose we have a non-standard variable "QSLANG" ("Question Language") in the QS (Questionnaires) domain. In the current SDTM implementation it is "banned" to the SUPPQS dataset.
We just keep that variable in the QS dataset and write in the define.xml:
<ItemGroupDef OID="QS" Name="QS" ... def:Label="Questionnaires" ...>
<ItemRef ItemOID="STUDYID" Mandatory="Yes" OrderNumber="1"/>
<ItemRef ItemOID="DOMAIN" Mandatory="Yes" OrderNumber="2"/>
<ItemRef ItemOID="USUBJID" OrderNumber="3"/>
....
<ItemRef ItemOID="QS.QSLANG" OrderNumber=".." def:IsNonStandardVariable="Yes"/>
</ItemGroupDef>
and of course define the corresponding ItemDef with OID "QS.QSLANG":
<ItemDef OID="QS.QSLANG" Name="QSLANG" def:Label="Question Language" DataType="text" Length="2"/>
DONE!
Does this help the reviewer in distinguishing between standard variables and non-standard variables?
Of course not very much. The FDA reviewers usually print out the define.xml (using the stylesheet view) and then inspect the XPT datasets in parallel. Not very modern isn't it?
The simple reason is that the viewer cannot read define.xml files.
Suppose however that the datasets do not come as XPT files, but as CDISC ODM (XML) files.
In CDISC ODM there is always a strong relation between the variable definition (ItemDef) and the data value (ItemData) over the OID. So for example, we would have (extending the above example):
<ItemDef OID="QS.QSLANG" Name="QSLANG" def:Label="Question Language" DataType="text" Length="2"/>
and the value:
<ItemGroupData ItemGroupOID="QS">
<ItemData ItemOID="STUDYID" Value="STUDYX"/>
<ItemData ItemOID="DOMAIN" Value="QS"/>
<ItemData ItemOID="USUBJID" Value="P0001"/>
...
<ItemData ItemOID="QS.QSLANG" Value="en"/>
</ItemGroupData>
This allows us to create software that tabulates the data and e.g. colors it depending whether it is a standard variable or a non-standard variable. This is very easy. It even can be done using an XSLT stylesheet when one wants to display the data in a browser.
It even creates new opportunities, such as:
- the software/stylesheet can easily validate the data, and e.g. color cells in the table (with an additional tooltip) for which there is an SDTM rule violation. For example an empty "QSREASND" can be colored red when the value of "QSSTAT" is "NOT DONE", this as the rule is that QSREASND may not be empty when QSSTAT is populated.
- In the LB (Laboratory) domain, values of "LBSTRESN" can be colored when the value is outside the reference range, i.e. lower than the value from LBSTNRLO or higher than the value of LBSTNRHI.
- rows that contain baseline data (--BLFL=Y) are colored differently
- data is sorted based on more than one criteria
All this is currently NOT possible using the SAS XPT Viewer, because it cannot read the define.xml containing the metadata.
There are much more opportunities: the number of them is only limited by your creativity.
Once we know how the future(?) ODM-based XML format for SDTM (and SEND and ADaM) data will look like, we will pilot such software and demonstrate it.
So ... keep tuned!
"The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the Non-Standard-Variables (NSVs) together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers. This document describes a proposed method by which NSVs and standard variables could be represented together in the parent datasets".
Or in other words: FDA reviewers are uncapable of merging/combining datasets (tables) so that the non-standard variables appear in their original table!
Oh my God: even my undergraduate students would be able to do that!
But why haven't the non-standard-variables not always be kept in their original table? Because they would then not be distinguishable from the standard variables? Yes, it is true that SAS Transport 5 does not have the capability of keeping metadata so that standard-variables and non-standard-variables can be distinguished, except for by naming conventions.
Wait a minute!
Metadata belong in the define.xml isn't it?
So why not just state in the define.xml that a variable is a non-standard variable? For example, suppose we have a non-standard variable "QSLANG" ("Question Language") in the QS (Questionnaires) domain. In the current SDTM implementation it is "banned" to the SUPPQS dataset.
We just keep that variable in the QS dataset and write in the define.xml:
<ItemGroupDef OID="QS" Name="QS" ... def:Label="Questionnaires" ...>
<ItemRef ItemOID="STUDYID" Mandatory="Yes" OrderNumber="1"/>
<ItemRef ItemOID="DOMAIN" Mandatory="Yes" OrderNumber="2"/>
<ItemRef ItemOID="USUBJID" OrderNumber="3"/>
....
<ItemRef ItemOID="QS.QSLANG" OrderNumber=".." def:IsNonStandardVariable="Yes"/>
</ItemGroupDef>
and of course define the corresponding ItemDef with OID "QS.QSLANG":
<ItemDef OID="QS.QSLANG" Name="QSLANG" def:Label="Question Language" DataType="text" Length="2"/>
DONE!
Does this help the reviewer in distinguishing between standard variables and non-standard variables?
Of course not very much. The FDA reviewers usually print out the define.xml (using the stylesheet view) and then inspect the XPT datasets in parallel. Not very modern isn't it?
The simple reason is that the viewer cannot read define.xml files.
Suppose however that the datasets do not come as XPT files, but as CDISC ODM (XML) files.
In CDISC ODM there is always a strong relation between the variable definition (ItemDef) and the data value (ItemData) over the OID. So for example, we would have (extending the above example):
<ItemDef OID="QS.QSLANG" Name="QSLANG" def:Label="Question Language" DataType="text" Length="2"/>
and the value:
<ItemGroupData ItemGroupOID="QS">
<ItemData ItemOID="STUDYID" Value="STUDYX"/>
<ItemData ItemOID="DOMAIN" Value="QS"/>
<ItemData ItemOID="USUBJID" Value="P0001"/>
...
<ItemData ItemOID="QS.QSLANG" Value="en"/>
</ItemGroupData>
This allows us to create software that tabulates the data and e.g. colors it depending whether it is a standard variable or a non-standard variable. This is very easy. It even can be done using an XSLT stylesheet when one wants to display the data in a browser.
It even creates new opportunities, such as:
- the software/stylesheet can easily validate the data, and e.g. color cells in the table (with an additional tooltip) for which there is an SDTM rule violation. For example an empty "QSREASND" can be colored red when the value of "QSSTAT" is "NOT DONE", this as the rule is that QSREASND may not be empty when QSSTAT is populated.
- In the LB (Laboratory) domain, values of "LBSTRESN" can be colored when the value is outside the reference range, i.e. lower than the value from LBSTNRLO or higher than the value of LBSTNRHI.
- rows that contain baseline data (--BLFL=Y) are colored differently
- data is sorted based on more than one criteria
All this is currently NOT possible using the SAS XPT Viewer, because it cannot read the define.xml containing the metadata.
There are much more opportunities: the number of them is only limited by your creativity.
Once we know how the future(?) ODM-based XML format for SDTM (and SEND and ADaM) data will look like, we will pilot such software and demonstrate it.
So ... keep tuned!
Thursday, November 8, 2012
SDTM databases and the FDA
Two days again, my (first year) bachelor students had their "Introduction to Databases" exam.
When looking for a good question on "CREATE VIEW" (or joins in general) I was thinking again about SDTM and whether it is suitable as a database. After all, the FDA has done several attempts to create an SDTM datawarehouse, and as everyone having some database skills knows, you cannot (or almost not) create a datawarehouse without having one or more databases.
So I came to the following exercise:
"Given the following tables:
write an SQL statement to generate the following result table:"
and then I gave them a picture of the result SDTM table for the Laboratory "LB" domain.
(P.S. a correct answer is: "
CREATE VIEW Laboratory AS
SELECT t.STUDYID, t.DOMAIN, t.USUBJID, t.LBSEQ, t.LBREFID, t.LBTESTCD, tc.LBTEST, tc.LBCAT, v.VISITNUM, v.VISIT, v.VISITDY FROM Laboratory_test t, Testcode tc, Visit v WHERE t.LBTESTCD = tc.LBTESTCD AND t.VISITNUM = v.VISITNUM; " )
This brought me to the following thoughts: "if we submit our SDTM datasets as essentially being a "View" on an SDTM database (see a previous contribution), how can the FDA reconstruct a database from this?"
After all, they want to use this kind of data in a datawarehouse, so they need to start from databases. If they would like to reconstruct the database from the (sas transport) tables, they do need to split the "view" (e.g. the LB sas dataset) into 3 or more tables.
Or can one start from the "view" to populate a datawarehouse?
Splitting up a "view" table into 3 (or more) tables so that the first, second and third normal forms are obeyed, and doing this in an automated way, does not look to be very simple to me.
Just suppose that there is one inconsistency in the "view" (sas dataset) e.g. that for the same LBTESTCD (e.g. "BILI") there is more than one corresponding value for LBTEST (e.g. once "Bilirubin" and one "Billirubin"). What would then happen?
But we do not know whether the FDA really tries to reconstruct the original database, or that it just uses the SDTM (sas) tables "as is".
Any comments (especially from FDA people) are of course very welcome.
When looking for a good question on "CREATE VIEW" (or joins in general) I was thinking again about SDTM and whether it is suitable as a database. After all, the FDA has done several attempts to create an SDTM datawarehouse, and as everyone having some database skills knows, you cannot (or almost not) create a datawarehouse without having one or more databases.
So I came to the following exercise:
"Given the following tables:
write an SQL statement to generate the following result table:"
and then I gave them a picture of the result SDTM table for the Laboratory "LB" domain.
(P.S. a correct answer is: "
CREATE VIEW Laboratory AS
SELECT t.STUDYID, t.DOMAIN, t.USUBJID, t.LBSEQ, t.LBREFID, t.LBTESTCD, tc.LBTEST, tc.LBCAT, v.VISITNUM, v.VISIT, v.VISITDY FROM Laboratory_test t, Testcode tc, Visit v WHERE t.LBTESTCD = tc.LBTESTCD AND t.VISITNUM = v.VISITNUM; " )
This brought me to the following thoughts: "if we submit our SDTM datasets as essentially being a "View" on an SDTM database (see a previous contribution), how can the FDA reconstruct a database from this?"
After all, they want to use this kind of data in a datawarehouse, so they need to start from databases. If they would like to reconstruct the database from the (sas transport) tables, they do need to split the "view" (e.g. the LB sas dataset) into 3 or more tables.
Or can one start from the "view" to populate a datawarehouse?
Splitting up a "view" table into 3 (or more) tables so that the first, second and third normal forms are obeyed, and doing this in an automated way, does not look to be very simple to me.
Just suppose that there is one inconsistency in the "view" (sas dataset) e.g. that for the same LBTESTCD (e.g. "BILI") there is more than one corresponding value for LBTEST (e.g. once "Bilirubin" and one "Billirubin"). What would then happen?
But we do not know whether the FDA really tries to reconstruct the original database, or that it just uses the SDTM (sas) tables "as is".
Any comments (especially from FDA people) are of course very welcome.
Saturday, November 3, 2012
Some comments on the "Pink sheet" article
The pink sheet, a well-known online journal on regulatory, legislative, legal and business developments, published an article on October 23 titled "FDA Data-Standards Landslide: "CDISC’s Model Wins Docket Comment Contest".
Here are some of the highlights followed by my personal comments (which are not necessarily those of the CDISC XML Technology Team - IMPORTANT):
"... Therefore, the agency wishes to replace it with an open set of standards “that will support interoperability,” and FDA asked for comments on CDISC/ODM versus a competing set of standards from Health Level 7."
The question here is whether the ancient SAS-XPT transport format currently used for electronic SDTM, SEND and ADaM submissions should be replaced by either a CDISC ODM based format or by an HL7-v3 based one (probably CDA-similar, or a totally new HL7 XML-based format). Please remark that the discussion is about a transport format, i.e. the SDTM standard itself would not necessarily need to change. Or in other words, the SDTM domains and variables would remain the same, it is just the way they are formatted and transported (the transport format) that would change.
"Pharma and bio industry majors, including Novo Nordisk AS, Novartis AG,Sanofi,Amgen Inc.,Merck Sharp & Dohme Ltd., Astellas Pharma US Inc., BioClinica andBiogen Idec Inc., lined up behind CDISC."
That is of course good to hear. Sponsor companies in the past did not matter too much about transport formats, they just concentrated on semantic standards like the SDTM standard. But they did realize that the XPT format was a "dead end", with very many disadvantages leading to a lot of problems. They now do also realize that CDISC ODM maybe an excellent modern replacement, and that HL7-v3 is not a suitable format for submissions to the FDA (see e.g. my article "Ten good reasons why an HL7-XML message is not always the best solution as a format for a CDISC standard (and especially not for submission data").
"Amgen’s comments were typical of the tenor of the support: “CDISC ODM is already well integrated into clinical data systems and there is a broad knowledge of this standard within the BioPharma industry already. It supports both metadata and data exchange, was developed for the exchange of clinical research data and is 21 CFR Part 11 compliant.
However, the
company added, it will be necessary to anticipate and deal with
variations in interpretation by different industry users of the standard
and between sponsors and FDA, as well as “handling multiple
versions within and across submissions for the same compound.”
The latter is of course correct. We will need to extend the ODM standard or develop an extension for it (as we did for define.xml) and we will need to publish an implementation guide. We will also develop an ISO-Schematron to clearly define most of the business rules such as the rules for the use of --STAT and --REASND. So we will not allow that there are different interpretations and "dialects" of the standard.
"Novartis' comments argued that 'given
that CDISC ODM is kept closer in synch with the CDISC standards
themselves and the industry has been watching and moving towards being
able to provide CDISC SDTM [CDISC’s
Study Data Tabulation Model]-compliant submissions, it would seem to
make sense that using CDISC ODM as an exchange standard makes more sense
than any other'. Novartis added that work remains to be
done on ODM to bring it into compliance with the widespread pharma
industry use of relational technology for data storage and analysis, as
well as on ODM’s extensions."
Of course, we also realize that work still needs to be done on ODM. Some of us are already inventorizing what should be done better in a next version of ODM, or what should be added. For example, we need even more support for data points coming from electronic health records (see some of my articles in another blog), this although ODM does already have such support through its extension mechanism.
"Merck’s backing of ODM was more
conditional. The company said it currently uses ODM for third-party data
but prefers to keep using SAS version 9 or an ASCII file format
“because the ODM is ‘performance heavy,’
meaning that it slows system performance.” Therefore, Merck’s comments
suggested FDA use the “Define.XML” subset of ODM, although elsewhere in
its comments it cautioned against using XML (extensible markup
language) itself as a submission standard, since
translating all study data into this format would be time-consuming and
overly demanding of FDA system resources."
The remark that ODM is "performance heavy" is not entirely true. I admit that this was the case 5-10 years ago, but since then, XML parsing software and libraries have evolved enormously and show much much better performance. My own observation is that reading and displaying a very large SDTM dataset in XML (given well written software) is as performant as with the SASViewer for XPT files.
Furthermore, some new technologies such as VTD-XML allow to minimize the memory usage when parsing very large XML files, such that files that have a disk size up to about 50% of the memory size can still be parsed and displayed.
Also here, my own experience is that for the same content, XML files with their tags do not have to be larger than SAS-XPT files, as the latter wastes a lot of disk space due to the fix-length records.
Furthermore, some new technologies such as VTD-XML allow to minimize the memory usage when parsing very large XML files, such that files that have a disk size up to about 50% of the memory size can still be parsed and displayed.
Also here, my own experience is that for the same content, XML files with their tags do not have to be larger than SAS-XPT files, as the latter wastes a lot of disk space due to the fix-length records.
Merck's comment on the use of define.xml builds further on this. My answer is that of course the new ODM-based format for electronic submissions will be based on define.xml!
Just as for collected data the ODM element "MetaDataVersion" contains the metadata and the element "ClinicalData" contains the actual data (and which have to match each other), the new format (although very little will be really new) will have the define.xml for the metadata and another (maybe new element) will contain the actual submission domain data (and which have to match the metadata in the define.xml).
Just as for collected data the ODM element "MetaDataVersion" contains the metadata and the element "ClinicalData" contains the actual data (and which have to match each other), the new format (although very little will be really new) will have the define.xml for the metadata and another (maybe new element) will contain the actual submission domain data (and which have to match the metadata in the define.xml).
"Commenters also
pointed out that clinical trials typically last so long that the data
standards may change while the study is under way, creating a big
problem for sponsors; that data standard disharmony
creates problems for “secondary data re-use” in future studies; and
that therapeutic area data standards are desperately needed".
The ODM team has always been very keen on taking care that ODM is "downwards compatible", i.e. that a valid ODM file of an earlier version is (except for the namespace usage) also a valid ODM file in the new versions. When we do deprecate an element or attribute, we always further support it for two intermediate versions. This is in huge contrast with HL7, where versions 2 and 3 of the standard are totally incompatible (0% reuse possible).
And even then. E.g. for define.xml we will provide an XSLT transformation stylesheet that transforms define.xml v.1.0 files into define.xml v.2.0 files.
"Several commenters noted that there is a
need for international harmonization of clinical data standards, a fact
highlighted even by groups that oppose CDISC/ODM. The Vereniging
EN13606 Consortium, a Dutch
group that backs a European communication standard called ISO 13606
EHR, and a group at Oxford University that backs other international
standards development initiatives both called for better harmonization."
As I already showed in a previous contribution, it is now already possible to include data points from electronic health records (EHRs) in ODM. In that blog contribution, I demonstrated how that can be done for data points from CDA EHRs, but the same is equally true for data points coming from EN13606-based EHRs (such as OpenEHR).
The commenters are right about international harmonization, but what they point to is essentially the "EHR standards war" between HL7 on one side and OpenEHR/ISO13606 on the other side.
It should not be the task to CDISC to solve this conflict, but our submission standard should be able to support both, which it does already do now.
It should not be the task to CDISC to solve this conflict, but our submission standard should be able to support both, which it does already do now.
Your comments are as always highly appreciated.
Sunday, September 23, 2012
Define.xml draft 2.0 is out
The long-awaited version 2.0 of the define.xml standard has now been published for public review by CDISC. It can be downloaded from the CDISC website at: http://www.cdisc.org/define-xml. If you are interested in define.xml, please download the distribution, review it and send your comments not later than October 1st. There is a separate file in the distribution explaining how to submit your comments.
Version 2.0 (draft) is based on the latest version of the ODM standard (v.1.3.1) taking away many of the limitations of v.1.0 of define.xml.
I am currently writing up my comments (good and bad) and already have more than 10 pages. The overall impression is however good - I think we are again taking a major step forward.
Some things I liked especially in the draft v.2.0:
- the "WhereClause" allows to define which units of measurement and categories (--CAT variables) for which cases (based on ValueLists e.g. for --ORRES).
- the "MethodOID" - "MethodDef" pair allows much better to add mapping information.
Things I did not like:
- "MeasurementRef" - "MeasurementDef" pairs are not supported. They are regarded as extensions to the standard. This is a bit stupid, as it breaks "end-to-end" processing. Instead the "WhereClause" needs to be used, which means an extra processing step.
The "WhereClause" is more general and is also applicable to other variables (such as --CAT). The problem however is not a define.xml or ODM problem, it is an SDTM problem!
Stupidly enough, variable qualifiers such as --ORRESU have been defined in SDTM as extra variables, i.e. extra columns in the SDTM table, instead of defining them as attributes (so a third dimension) to the parent variable (--ORRES). The consequence is that there is no way in SDTM to check whether the pair --ORRES / --ORRESU really matches and makes sense. The same applies to the --TESTCD (test code) / --TEST (test name) pair (also see my previous post)
Had ODM been used for transporting SDTM data, then everything would have been much more simple.
So the current "WhereClause" mechanism must be regarded as the best possible solution for supplying metadata to variables of a badly designed SDTM.
- "SASFormatName" is mandatory for CodeLists. Why? "SASFormatName" is not vendor-neutral! My personal opinion is that such an attribute should go into a "vendor" extension. However, it is traditionally already in ODM for a long time (SAS is a market leader in our industry). The fact that it has been made mandatory in define.xml 2.0 draft seems to have to do with that FDA reviewers want to have the ability to automatically create SAS tables for codelists. But why can't FDA reviewers assign SAS format names for codelists if they want to?
Things I suggest for improvement:
- explicitely allow "FormalExpression" as child of "MethodDef". The draft define.xml allows to add a reference to an external file with the calculation or imputation (source) code. However, it makes much more sense to allow to have the code within the define.xml itself. ODM 1.3.1 has an excellent, existing mechanism for that: FormalExpression.
Version 2.0 (draft) is based on the latest version of the ODM standard (v.1.3.1) taking away many of the limitations of v.1.0 of define.xml.
I am currently writing up my comments (good and bad) and already have more than 10 pages. The overall impression is however good - I think we are again taking a major step forward.
Some things I liked especially in the draft v.2.0:
- the "WhereClause" allows to define which units of measurement and categories (--CAT variables) for which cases (based on ValueLists e.g. for --ORRES).
- the "MethodOID" - "MethodDef" pair allows much better to add mapping information.
Things I did not like:
- "MeasurementRef" - "MeasurementDef" pairs are not supported. They are regarded as extensions to the standard. This is a bit stupid, as it breaks "end-to-end" processing. Instead the "WhereClause" needs to be used, which means an extra processing step.
The "WhereClause" is more general and is also applicable to other variables (such as --CAT). The problem however is not a define.xml or ODM problem, it is an SDTM problem!
Stupidly enough, variable qualifiers such as --ORRESU have been defined in SDTM as extra variables, i.e. extra columns in the SDTM table, instead of defining them as attributes (so a third dimension) to the parent variable (--ORRES). The consequence is that there is no way in SDTM to check whether the pair --ORRES / --ORRESU really matches and makes sense. The same applies to the --TESTCD (test code) / --TEST (test name) pair (also see my previous post)
Had ODM been used for transporting SDTM data, then everything would have been much more simple.
So the current "WhereClause" mechanism must be regarded as the best possible solution for supplying metadata to variables of a badly designed SDTM.
- "SASFormatName" is mandatory for CodeLists. Why? "SASFormatName" is not vendor-neutral! My personal opinion is that such an attribute should go into a "vendor" extension. However, it is traditionally already in ODM for a long time (SAS is a market leader in our industry). The fact that it has been made mandatory in define.xml 2.0 draft seems to have to do with that FDA reviewers want to have the ability to automatically create SAS tables for codelists. But why can't FDA reviewers assign SAS format names for codelists if they want to?
Things I suggest for improvement:
- explicitely allow "FormalExpression" as child of "MethodDef". The draft define.xml allows to add a reference to an external file with the calculation or imputation (source) code. However, it makes much more sense to allow to have the code within the define.xml itself. ODM 1.3.1 has an excellent, existing mechanism for that: FormalExpression.
Saturday, July 21, 2012
Is SDTM a database (design), and if so - is it a good one?
SDTM 1.3 and SDTM-IG 3.1.3 have just been publised. A good time to review and discuss what SDTM iin fact is. SDTM consists of related tables, so is it a database (design)?
At the university I am teaching databases and database design, so I do have some authority to make an analysis from a theoretical point of view.
Let us take a typical table from the SDTM: the Vital Signs table. A snapshot view is shown below.
First of all the keys: the combination of STUDYID, USUBJID, VSSEQ can be defined as the primary key. Although the latter is an artificial key, usually generated in the very last stage of the generation of the table.
USUBJID is certainly a foreign key to USUBJID in the DM table, and VISIT a foreign key to the TV (Trial Visits) table. Or is VISITNUM the foreign key to the TV table?
Here is where the problems start ...
In our first database course (second semestor bachelor) we teach the students about the first, second and third "normal forms" (usually abbreviated as NF). A good database design is in the third NF. A bad database design in the second NF, and a very bad database design in the first NF. A database that doesn't obey the first NF is not worth the designation "database".
The difference between a database design in the third NF and one in the second NF is that the former does not have any "transitive dependencies" anymore. A transitive dependency is an A -> B -> C dependency. Let me explain a bit.
Suppose a table with customer number (primary key), first and family name, postal code (zip code) and city.
When I do know the customer number, I do immediately know his first and family name by a simple lookup (that is what we call a dependency). I do also know his postal code. So we have two dependencies:
CUSTOMERNUMBER -> FIRSTNAME,FAMILYNAME and
CUSTOMER -> POSTALCODE
There is however still another dependency which is not on the primary key, i.e.
POSTALCODE -> CITY
as if I do know the postal (zip) code, I do automatically know the name of the city.
So we have:
CUSTOMER -> POSTALCODE -> CITY
This is what is exactly meant with a "transitive dependency", and the third NF states that these should be avoided.
Why?
Let us take an example again: I have a customer John Smith with customer number 12345 with zip code 10001, living in Manhatten, New York, all in the same table. Everything OK.
I now add a new customer (I do an insert into the database table) with customer number 23456 with name Mary Poppins, and postal code 10001 (again) and city Chicago. Agree?
Most of you will immediately protest, as the postal code 10001 does not belong to Chicago! It belongs to Manhatten, New York!
However, a database management system will approve such an insert as it has no way of knowing that there is a relationship between postal code 10001 and Manhatten. If I do know "10001" I do immediately know that the customer is living in Manhatten (remark here that I assume that this is a database of customers living in the USA only).
So, what does the theory of the third NF learn us to do in such a case? It learns us that we should split the table into two tables: The "customer" table will contain the postal (zip) code along with some other information (but not the city), and the postal code should be a foreign key to another table which contains the postal codes and their corresponding cities. When adding a new customer, it is then not possible anymore to create an invalid postal code / city pair, which increases the probability of keeping a consistent database.
Let us know return to our SDTM VS table. Obviously it is a hypervertical table (more or less according to the so-called Entity-Attribute-Value (or EAV) model). Does it obey to the third NF? Or does it have transitive dependencies? Let us have a look.
We immediately see that there is a 1:1 relationship between VSTESTCD and VSTEST: if I do know the value of VSTESTCD, then the value of VSTEST is fixed. For example, when VSTESTCD has the value "SYSBP", the value of "VSTEST" MUST be "Systolic Blood Pressure". It cannot be e.g. "Hearth Rate".
If I do have such a case (combination of VSTESTCD=SYSBP and VSTEST=Hearth Rate) tools like OpenCDISC will give me an error. If I build up a database with such a discrepancy however, the database management system will not protest at all.
The same is true for VISITNUM and VISIT in the VS table. They are dependent on each other: if VISITNUM is given, VISIT is fixed, and vice versa.
If I generate a database and do an insert into the VS table with a value of "1" for VISITNUM and "Visit 1" for VISIT, and then do (for example for another subject) an insert with "1" and "Visit 7", the database management system will not protest at all as it cannot know that there is this 1:1 relationship.
So: SDTM cannot be considered as a good database design, as it does not obey the third NF!
So what is SDTM? Why did the designers of SDTM allow that VSTESTCD and VSTEST have a dependency but are in the same table, and that VISITNUM and VISIT are columns in the same table although that violates the rule of the third NF?
The reason is simple: because the reviewers wanted this so.
A reviewer wants to see VSTESTCD and VSTEST at the same time because he/she maybe doesn't know that "HR" means "Hearth Rate". You might think this is a stupid argument, but think about the LB domain with its large numbers of test codes: a reviewer cannot know them all.
So, essentially, SDTM is a VIEW on a database, as we would implement it for end users that only have read access. For those not common with the concept of database views, here is a link to the Wikipedia entry.
So now you might ask: why don't we send the FDA the SDTM database (tables) that has a correct design (third NF - meaning minimal risk of inconsistencies) and then the FDA loads it into the data warehouse and then provides views on that to the reviewers. That would be ideal, wouldn't it?
The reason is that the FDA does not have a functioning SDTM data warehouse (there have been several attempts/versions of JASON but none of them were really of any use, and the project has now been outsourced to NCI), and that the reviewers use the SDTM tables "as is", i.e. just using the SAS viewing tools (SASViewer).
As far as I know, the received SDTM tables are never used to populate a reviewers database, and where there is no database, views can of course not be provided.
Conclusions: SDTM is not a database (design), it is a VIEW on a database. However, most people (including the FDA reviewers) use it as a database, and do not try to recover the (or a) database from the view.
At the university I am teaching databases and database design, so I do have some authority to make an analysis from a theoretical point of view.
Let us take a typical table from the SDTM: the Vital Signs table. A snapshot view is shown below.
First of all the keys: the combination of STUDYID, USUBJID, VSSEQ can be defined as the primary key. Although the latter is an artificial key, usually generated in the very last stage of the generation of the table.
USUBJID is certainly a foreign key to USUBJID in the DM table, and VISIT a foreign key to the TV (Trial Visits) table. Or is VISITNUM the foreign key to the TV table?
Here is where the problems start ...
In our first database course (second semestor bachelor) we teach the students about the first, second and third "normal forms" (usually abbreviated as NF). A good database design is in the third NF. A bad database design in the second NF, and a very bad database design in the first NF. A database that doesn't obey the first NF is not worth the designation "database".
The difference between a database design in the third NF and one in the second NF is that the former does not have any "transitive dependencies" anymore. A transitive dependency is an A -> B -> C dependency. Let me explain a bit.
Suppose a table with customer number (primary key), first and family name, postal code (zip code) and city.
When I do know the customer number, I do immediately know his first and family name by a simple lookup (that is what we call a dependency). I do also know his postal code. So we have two dependencies:
CUSTOMERNUMBER -> FIRSTNAME,FAMILYNAME and
CUSTOMER -> POSTALCODE
There is however still another dependency which is not on the primary key, i.e.
POSTALCODE -> CITY
as if I do know the postal (zip) code, I do automatically know the name of the city.
So we have:
CUSTOMER -> POSTALCODE -> CITY
This is what is exactly meant with a "transitive dependency", and the third NF states that these should be avoided.
Why?
Let us take an example again: I have a customer John Smith with customer number 12345 with zip code 10001, living in Manhatten, New York, all in the same table. Everything OK.
I now add a new customer (I do an insert into the database table) with customer number 23456 with name Mary Poppins, and postal code 10001 (again) and city Chicago. Agree?
Most of you will immediately protest, as the postal code 10001 does not belong to Chicago! It belongs to Manhatten, New York!
However, a database management system will approve such an insert as it has no way of knowing that there is a relationship between postal code 10001 and Manhatten. If I do know "10001" I do immediately know that the customer is living in Manhatten (remark here that I assume that this is a database of customers living in the USA only).
So, what does the theory of the third NF learn us to do in such a case? It learns us that we should split the table into two tables: The "customer" table will contain the postal (zip) code along with some other information (but not the city), and the postal code should be a foreign key to another table which contains the postal codes and their corresponding cities. When adding a new customer, it is then not possible anymore to create an invalid postal code / city pair, which increases the probability of keeping a consistent database.
Let us know return to our SDTM VS table. Obviously it is a hypervertical table (more or less according to the so-called Entity-Attribute-Value (or EAV) model). Does it obey to the third NF? Or does it have transitive dependencies? Let us have a look.
We immediately see that there is a 1:1 relationship between VSTESTCD and VSTEST: if I do know the value of VSTESTCD, then the value of VSTEST is fixed. For example, when VSTESTCD has the value "SYSBP", the value of "VSTEST" MUST be "Systolic Blood Pressure". It cannot be e.g. "Hearth Rate".
If I do have such a case (combination of VSTESTCD=SYSBP and VSTEST=Hearth Rate) tools like OpenCDISC will give me an error. If I build up a database with such a discrepancy however, the database management system will not protest at all.
The same is true for VISITNUM and VISIT in the VS table. They are dependent on each other: if VISITNUM is given, VISIT is fixed, and vice versa.
If I generate a database and do an insert into the VS table with a value of "1" for VISITNUM and "Visit 1" for VISIT, and then do (for example for another subject) an insert with "1" and "Visit 7", the database management system will not protest at all as it cannot know that there is this 1:1 relationship.
So: SDTM cannot be considered as a good database design, as it does not obey the third NF!
So what is SDTM? Why did the designers of SDTM allow that VSTESTCD and VSTEST have a dependency but are in the same table, and that VISITNUM and VISIT are columns in the same table although that violates the rule of the third NF?
The reason is simple: because the reviewers wanted this so.
A reviewer wants to see VSTESTCD and VSTEST at the same time because he/she maybe doesn't know that "HR" means "Hearth Rate". You might think this is a stupid argument, but think about the LB domain with its large numbers of test codes: a reviewer cannot know them all.
So, essentially, SDTM is a VIEW on a database, as we would implement it for end users that only have read access. For those not common with the concept of database views, here is a link to the Wikipedia entry.
So now you might ask: why don't we send the FDA the SDTM database (tables) that has a correct design (third NF - meaning minimal risk of inconsistencies) and then the FDA loads it into the data warehouse and then provides views on that to the reviewers. That would be ideal, wouldn't it?
The reason is that the FDA does not have a functioning SDTM data warehouse (there have been several attempts/versions of JASON but none of them were really of any use, and the project has now been outsourced to NCI), and that the reviewers use the SDTM tables "as is", i.e. just using the SAS viewing tools (SASViewer).
As far as I know, the received SDTM tables are never used to populate a reviewers database, and where there is no database, views can of course not be provided.
Conclusions: SDTM is not a database (design), it is a VIEW on a database. However, most people (including the FDA reviewers) use it as a database, and do not try to recover the (or a) database from the view.
Thursday, March 1, 2012
Null flavors in SDTM - a good idea?
CDISC recently published the "revised Trial Summary datasets" for the SDTM standard and implementation guide. One of the new "features" of the TS data set is that it has a number of so-called "null flavors". You can think of a null flavor as something similar like "reason not done" (--REASND), but than enumerated. For the TS dataset the enumerations (each a "flavor of null") are:
It is clear that the idea, and its CDISC-SDTM-TS implementation, come from the HL7-v3 world, as the enumerated values (and not by accident I believe) are exactly the same as those in HL7-v3.
The use of "null flavors" is however even within HL7 highly contested. For me, and for many others, it is e.g. very illogical that a value can be null and positive infinite at the same time.
So some implementations of HL7-v3, such as the Austrian "Entlassungsbrief" (similar to the US "Continuity of Care" (CCD) document, have limited the enumeration to the absolute minimum - in this case to only two allowed values.
Personally, I am always suspicious when a list of enumerations has say more than 5-6 values, especially when it is not about "hard" characteristics. You can make an enumeration for "gender", like F (female), M (male) and U (unknown) which is good enough for 99.9% of the cases.
But in the "null flavor" case here, everything is pretty subjective ...
For example, when to apply "NI" and when to apply "UNK"?
And when we know that a value is "<1mg", should we add it as a value, or should we set it to NULL, and fill the "null flavor" with "TRACE"?
And what to think about "positive infinite" and "negative infinite"? These are surely not null!
Wasn't this introduced due to the unability of SAS Transport 5 to use the "∞" character?
In XML (Schema) one can simply define a value of being of type "xs:double" which includes "INF" (positivive infinite) and "-INF" (negative infinite).
Let us have a look why the authors of the new TS-SDTM introduced "null flavors". The argumentation (copied from the document) is:
"The proposal to include a null flavor variable to supplement the TSVAL variable in the Trial Summary dataset arose when it was realized that the Trial Summary model did not have a good way to represent the fact that a protocol placed no upper limit on the age of study subjects.When the trial summary parameter is AGEMAX, then TSVAL should have a value expressed as an ISO8601 time duration (e.g., P43Y for 43 years old or P6M for 6 months old).While it would be possible to allow a value such as NONE or UNBOUNDED to be entered in TSVAL, ..."
OK, but wait a minute ... why should the maximum age be expressed as a ISO-8601 "period". It was never designed for that. And what about a maximum age criterium like "at least 30 years older than the age at which birth was given the last time". The latter could surely be a valid "age" criterium (but it could also be part of an inclusion criterium). So in my opinion, the developers of the SDTM have abused the "duration" data type here.
And for the AGEMAX parameter, shouldn't "unbounded" be "null" (i.e. there is none), or alternatively "∞". But yes, the latter cannot be depicted in SAS Transport 5 ...
What do you think? Is it a good idea to have "null flavors" in SDTM? Or do you think it isn't?
Just let me know ...
- NI (no information)
- INV (invalid)
- OTH (other)
- PINF (positive infinite)
- NINF (negative infinite)
- UNC (unencoded)
- DER (derived)
- UNK (unknown)
- ASKU (asked but unknown)
- NAV (temporarily unavailable)
- NASK (not asked)
- QS (quantity sufficient)
- TRC (trace)
- MSK (masked)
- NA (not applicable)
It is clear that the idea, and its CDISC-SDTM-TS implementation, come from the HL7-v3 world, as the enumerated values (and not by accident I believe) are exactly the same as those in HL7-v3.
The use of "null flavors" is however even within HL7 highly contested. For me, and for many others, it is e.g. very illogical that a value can be null and positive infinite at the same time.
So some implementations of HL7-v3, such as the Austrian "Entlassungsbrief" (similar to the US "Continuity of Care" (CCD) document, have limited the enumeration to the absolute minimum - in this case to only two allowed values.
Personally, I am always suspicious when a list of enumerations has say more than 5-6 values, especially when it is not about "hard" characteristics. You can make an enumeration for "gender", like F (female), M (male) and U (unknown) which is good enough for 99.9% of the cases.
But in the "null flavor" case here, everything is pretty subjective ...
For example, when to apply "NI" and when to apply "UNK"?
And when we know that a value is "<1mg", should we add it as a value, or should we set it to NULL, and fill the "null flavor" with "TRACE"?
And what to think about "positive infinite" and "negative infinite"? These are surely not null!
Wasn't this introduced due to the unability of SAS Transport 5 to use the "∞" character?
In XML (Schema) one can simply define a value of being of type "xs:double" which includes "INF" (positivive infinite) and "-INF" (negative infinite).
Let us have a look why the authors of the new TS-SDTM introduced "null flavors". The argumentation (copied from the document) is:
"The proposal to include a null flavor variable to supplement the TSVAL variable in the Trial Summary dataset arose when it was realized that the Trial Summary model did not have a good way to represent the fact that a protocol placed no upper limit on the age of study subjects.When the trial summary parameter is AGEMAX, then TSVAL should have a value expressed as an ISO8601 time duration (e.g., P43Y for 43 years old or P6M for 6 months old).While it would be possible to allow a value such as NONE or UNBOUNDED to be entered in TSVAL, ..."
OK, but wait a minute ... why should the maximum age be expressed as a ISO-8601 "period". It was never designed for that. And what about a maximum age criterium like "at least 30 years older than the age at which birth was given the last time". The latter could surely be a valid "age" criterium (but it could also be part of an inclusion criterium). So in my opinion, the developers of the SDTM have abused the "duration" data type here.
And for the AGEMAX parameter, shouldn't "unbounded" be "null" (i.e. there is none), or alternatively "∞". But yes, the latter cannot be depicted in SAS Transport 5 ...
What do you think? Is it a good idea to have "null flavors" in SDTM? Or do you think it isn't?
Just let me know ...
Sunday, February 5, 2012
Study Design in XML (SDM-XML): what is still missing?
Last year, our CDISC volunteer team published the "Study Design Model im XML", an extension to the ODM standard, filling some functionality gaps (i.e. study design features I was already asking for for years) of the existing ODM standard.
We are now almost a year further, and have had the time to implement the SDM-XML in our tools (such as the ODM Study Designer). We also tested the model against many use cases (even as a possible replacement for submitting study design information to the regulatory authorities).
So it is now time to make an inventory.
We found that the model is already used (or has been prototyped) intensively for searching for possible subjects for studies using patient information such as from EHRs (from the inclusion/exclusion part of the model). We also found that our "workflow" model of SDM-XML can easily be transformed in BPMN-2-XML, which is the generally accepted standard for workflows. This e.g. allows to import a clinical workflow into a hospital information system (HIS) and integrate it with the workflow of the patient care.
There are a few things however that are still failing in our model:
- subactivities: i.e. activities within activities. For example, there can be an activity "place an ambulatory ECG device", which itself consists of a number of subactivities. In ODM/SDM-XML this is not supported (yet). This also has to do with the lack of support for subworkflows.
- subworkflows: i.e. workflows within workflows. Building on the previous example, the activity "place an ambulatory ECG device" will consist of a number of subactivities which need to be executed in a certain order, and maybe with some forking or branches, so needing a subworkflow.
- swimlanes. Those knowing a bit about workflows do also know the concept of "swimlanes". For those who don't, I could try to explain, but wikipedia does a much better job here.
In order to implement swimlanes, we do however need "roles", such as "primary investigator", "study nurse" or "monitor". We don't have this in ODM, we only have the concept of "user" (which is in priciple a person) with a "UserType" attribute. The latter could be used for "role", but this could better be avoided as its current enumerated list is just too limited and not adequate.
So we would need to extend the ODM further than we already did.
At the same time, we have the problem that "StudyEvent" (for a visit) is not a good concept anymore. In SDM-XML, a visit is essentially a container for a number of activities, which can be datacollection activities or activities in which no data is captured (such as placing the ambulatory ECG device). If we allow activities to have subactivities (and subworkflows) isn't a StudyEvent nothing more than an Activity?
Also (in my opinion) we should introduce an element "Role" in the "AdminData section" of the ODM, and allow to assign roles to "Users" (P.S. A person can have one or more roles).
This could then allow us to introduce the concept of "swimlanes" in the workflow part of ODM/SDM-XML.
I do already have a list of other wishes I have for ODM.
If we find the necessary (human) resources, wouldn't it be time to start thinking about a future ODM Version 1.4?
Let me know what you think!
We are now almost a year further, and have had the time to implement the SDM-XML in our tools (such as the ODM Study Designer). We also tested the model against many use cases (even as a possible replacement for submitting study design information to the regulatory authorities).
So it is now time to make an inventory.
We found that the model is already used (or has been prototyped) intensively for searching for possible subjects for studies using patient information such as from EHRs (from the inclusion/exclusion part of the model). We also found that our "workflow" model of SDM-XML can easily be transformed in BPMN-2-XML, which is the generally accepted standard for workflows. This e.g. allows to import a clinical workflow into a hospital information system (HIS) and integrate it with the workflow of the patient care.
There are a few things however that are still failing in our model:
- subactivities: i.e. activities within activities. For example, there can be an activity "place an ambulatory ECG device", which itself consists of a number of subactivities. In ODM/SDM-XML this is not supported (yet). This also has to do with the lack of support for subworkflows.
- subworkflows: i.e. workflows within workflows. Building on the previous example, the activity "place an ambulatory ECG device" will consist of a number of subactivities which need to be executed in a certain order, and maybe with some forking or branches, so needing a subworkflow.
- swimlanes. Those knowing a bit about workflows do also know the concept of "swimlanes". For those who don't, I could try to explain, but wikipedia does a much better job here.
In order to implement swimlanes, we do however need "roles", such as "primary investigator", "study nurse" or "monitor". We don't have this in ODM, we only have the concept of "user" (which is in priciple a person) with a "UserType" attribute. The latter could be used for "role", but this could better be avoided as its current enumerated list is just too limited and not adequate.
So we would need to extend the ODM further than we already did.
At the same time, we have the problem that "StudyEvent" (for a visit) is not a good concept anymore. In SDM-XML, a visit is essentially a container for a number of activities, which can be datacollection activities or activities in which no data is captured (such as placing the ambulatory ECG device). If we allow activities to have subactivities (and subworkflows) isn't a StudyEvent nothing more than an Activity?
Also (in my opinion) we should introduce an element "Role" in the "AdminData section" of the ODM, and allow to assign roles to "Users" (P.S. A person can have one or more roles).
This could then allow us to introduce the concept of "swimlanes" in the workflow part of ODM/SDM-XML.
I do already have a list of other wishes I have for ODM.
If we find the necessary (human) resources, wouldn't it be time to start thinking about a future ODM Version 1.4?
Let me know what you think!
Wednesday, January 11, 2012
Define.xml - an extension or a subset of ODM?
In my second-last post, I wrote about the discussion within the define.xml team (that is currently working on define.xml v.2) whether ODM elements not explicitely mentioned in the define.xml specification should be allowed or not. The majority of the team finds that it should not (i.e. they should be forbidden). The people that work on end-to-end insist that they should be allowed.
A typical example is "MeasurementUnitRef" and "MeasurementUnitDef".
As define.xml has been developed as a "vendor extension" to the ODM, I looked into the ODM specification about what it says about vendor extensions.
Section 2.4 of the ODM specification (1.3.1) which is about vendor extensions states:
Whether they then do something with the information is another case.
So for me it is clear that the define.xml team is not allowed to forbid the use of ODM elements that are not explicitely mentioned in the define.xml spec, nor state that such elements "are not part of the standard".
If they do so, they break the rules of the ODM standard on which they base their extension.
A typical example is "MeasurementUnitRef" and "MeasurementUnitDef".
As define.xml has been developed as a "vendor extension" to the ODM, I looked into the ODM specification about what it says about vendor extensions.
Section 2.4 of the ODM specification (1.3.1) which is about vendor extensions states:
- The extension may add new XML elements and attributes, but may not render any standard ODM elements or attributes obsolete
- Removing all vendor extensions from an extended ODM file must result in a meaningful and accurate standard ODM file
- Applications that use extended ODM files must also accept standard ODM files
Whether they then do something with the information is another case.
So for me it is clear that the define.xml team is not allowed to forbid the use of ODM elements that are not explicitely mentioned in the define.xml spec, nor state that such elements "are not part of the standard".
If they do so, they break the rules of the ODM standard on which they base their extension.
Subscribe to:
Posts (Atom)