Working on and with CDISC Standards: 2013

Saturday, September 21, 2013

Is CDISC Controlled Terminology going the wrong way? A follow up

So I made some "change requests" to the CDISC-CT ...

Here are the answers (in italic) that were send to me by the CDISC-CT team.

"The CDISC Terminiology team met to discuss your requests. Here are our decisions:

a) pH - Team agrees to remove from Unit codelist.

b) BEATS/MIN - This is a common clinical unit of measure for heart rate and pulse rate. It is classified as a unit of measure by SNOMED-CT. It is classified as a unit of measure by NCPDP (National Council for Prescription Drug Program). Therefore it will not be removed.

c) BAR - For CDISC, both units of measure and units of presentation are in the same codelist. It is up to the user based on the definition to select the correct term. This term is published in the FDA SPL list. We understand your concern regarding units of measure versus units of presentation. We have included units for presentation from ISO 11616 as distinguished from units of measure. If BAR for pressure is needed for clinical trial work, please submit a request to add that as a new term. Currently we will leave "BAR" as a dosing unit.

Concerning a) thank you - there seem to be a few chemists left on this world
Concerning b) essentially says: we copy the errors of others
Concerning c) oh my God ...

How does UCUM deal with the "beats/min" issue?
From the specification I read:
"§6 curly braces 1) The full range of characters 33–126 can be used within a pair of curly braces (‘{’ and ‘}’). The material enclosed in curly braces is called annotation.
2) Annotations do not contribute to the semantics of the unit but are meaningless by definition. Therefore, any fully conformant parser must discard all annotations. Parsers of limited conformace should not value annotations in comparison of units.
3) Annotations do, however, signify the end of a unit symbol.
4) An annotation without a leading symbol implies the default unit 1 (the unity).
5) Curly braces must not be nested.
Curly braces are here because people want annotations and deeply believe that they need annotations. Especially in chemistry and biomedical sciences, there are traditional habits to write annotations at units or instead of units, such as “%vol.”, “RBC”, “CFU”, “kg(wet tis.)”, or “mL(total)”. These habits are hard to overcome. Any attempt of a coding scheme to restrict this percieved expressiveness will ultimately result in the coding scheme not being adopted, or just “half-way” adopted (which is as bad as not adopted).
Two alternative responses to this reality exist: either give in to the bad habits and blow up of the code with dimension- and meaningless unit atoms, or canalize this habit so that it does no harm. The Unified Code for Units of Measure canalizes this habit using curly braces."

So, for a heart beat frequency, the correct UCUM unit is "/min", but also ("canalized bad habit") the following would be allowed "{beats}/min".

For "units" such as "bar", "bottle", "capsule" I have now proposed the team to move these into a separate codelist, so that these are really separate from "measument units" (according to wikipedia: "standardized quantity of measurement of a physical quantity").
This would strongly improve machine-processability of SDTM records: a machine can then be explained that "units" of this codelist can NOT be used for comparisons. With the current situation (mixing up of standardized units and non-standardized units) such machine-processability is impossible (or extremely dangerous).

I will follow up as soon as I have a response from the CDISC-CT team.

Wednesday, July 10, 2013

Is CDISC Controlled Terminology going the wrong way?

Today, I downloaded the latest (June 2013) CDISC controlled terminology (CDISC-CT) from the NCI website.
I am already communicating (well, it's more a one-way communication) with the CDISC-CT team for about a year now, as I do have strong concerns about the direction CDISC-CT is taking.

As you might know, CDISC-CT is just a list. It does not have a system (unlike LOINC), nor it is a taxonomy (like SNOMED-CT). It's just a list, with all imaginable problems.

Here are my major concerns:
- the CT does not show at all that there is a 1:1 relation between --TESTCD (test code) and --TEST (test name). There are separate lists for each of them, as if these were unrelated.
- CDISC-CT reinvents the wheel: some codelists like those for lab tests, units of measure, and microorganisms are unnecessary as there already exists well established CT used by healthcare. For lab tests there is LOINC, for units there is UCUM, and for microorganisms there is the "International Code of Nomenclature of Bacteria" (ICNB). All these have a system or are taxonomies, so why try to replace them with something basic like a simple list? Does CDISC-CT suffer under the "not invented here syndrome"?
Some within the CDISC-CT team claimed that their codelists are subsets of the mentioned ones. I checked, and it simply is not true! Others told me the CDISC-CT can easily be mapoed to the mentioned ones. I checked and it simply is not true! Due to the system (or better: lack of system) CDISC-CT is using, these lists will grow to almost infinity, and the length of the latest version (>7000 terms) strengthens my opinion on this. I heard of one sponsor who submitted over 500 "new term requests", only for lab tests.
So why not use something (LOINC, UCUM, ICNB) that is used by everyone in health care and that has proven its merits?

There are also some peculiarities in the CDISC-CT list that draw my attention. What to think of the following:
- the unit of pH is ... PH!
In my first year at high school (well, I do also have a PhD in chemistry) I learned that pH is a property that is dimensionless, i.e. it does not have a unit.
- bar: from the same teacher (my chemistry teacher was also my physics teacher), I learned that "bar" is a (even SI) unit for the property "pressure". According to CDISC-CT "bar" is defined by "A dosing measurement based on the bar unit". Do you understand what is meant here? I don't. Is this a circular definition (also not a definition). Is "bar" meant for something like "a bar of chocolate"? If so, is that a unit? Is a bar of chocolate the same amount in Belgium as it is in the US or Tokyo?
Under "unit" I understand that it is the same amount everywhere in the world (well, a "gram" is).
- beats/minute: this is a more subtle one. The definition in CDISC-CT is: "the number of heart beats measured per minute time". Makes sense, isn't it? No: it does NOT. First of all, the definition - is "beats/minute" limited to "heart beats"? I don't think so, I usually also use it for drum beats.
Secondly, the CDISC-CT mixes up the property (heart beats) with the units. A heart beat is measured as a frequeny, and some popular units for frequency are "/min" and "/sec".
- "bottle", "box", "can", "capsule", ... Is a "bottle" a "unit"? According to CDISC-CT, it is. I have a lot of bottles at home, and most can contain different amounts of liquids.
CDISC-CT is mixing up the concept of "object class" (in this case the parent class is "container") with the concept of "unit" (unity amount that is the same everywhere in the world).

I shared all these concerns with the CDISC-CT already more than half a year ago, but they just do not seem to care. The CDISC-CT-LAB subteam promised me to write up an answer to my suggestion not to define their own codelist, but to use the existing LOINC standard instead about 5 months ago, but I still haven't received it (though I send several reminders).

Do not misunderstand me - I am not against the development of CDISC controlled terminology! We really need CDISC-CT to be all "on the same wavelength" and especially to be able to compare values between different submissions of different sponsors.
But if we do, we need to do it correctly, and not try to reinvent the wheel. Nor should we forget our basics in physics, chemistry and especially in semantics, just for the sake of trying to controll everything.

Sunday, April 21, 2013

Alternatives for SAS Transport 5 - 20 years ago

I have been often criticized that I criticize SAS Transport 5 without proposing a better alternative.
This blog contribution is a response to that.

Many many years ago (I haven't been able to find out exactly how many years ago - If you know it, please let me know), the FDA decided for SAS Transport 5 (".xpt" format) for submission of datasets. It is not clear to me how and why they came to this decision. I discussed it with several people that were already around at that time, but their statements are not unambiguous.
It is currently however clear that this was not a good choice, but that is an easy statement many years later.

First of all, is the choice for transport format important?
Essentially, SDTM is a semantic standard: it is about content.
SAS XPT, XML, CSV and others are transport formats, you can transport any kind of data with them.
So, one could transport SDTM data using any transport format, as the two are essentially independant. Stating e.g. that HL7-v3 is a better format for transporting SDTM data as it allows integration with healthcare is nonsense! It is the same as stating that "using the same truck for transporting cows and for transporting oranges allows us to breed cows that produce orange juice".
However, we cannot expect from the FDA that it accepts SDTM (and SEND and ADaM) data in any transport format - a choice (only one, or maybe a few) must be selected.

So what could have been the alternatives 20 years ago?
First we must take into account that at that time we did not have define.xml for the metadata (XML was still in it's early stages and now well known): metadata was usually submitted as a text file or PDF.

Let us look at few alternatives ...

A. CSV
Comma-separated (or similar) format has the advantage that it is very compact. It is for example (using a vertical bar as field separator) used in HL7-v2 messages with a lot of success.
I recently generated some CSV files from SDTM XPT datasets and for example found that the LB dataset from the FDA Pilot "LZZT" study was reduced by a factor 6 in file size. "Great" you would say when you think about the "file size complaints" of the FDA. However CSV also has some major disadvantages. First of all: how to declare the variable names and labels? That can be solved by adding two extra lines at the top of the document containing them. But what about associated codelists for coded variables (not to speak about valuelists)? That is much more difficult. Of course one can add extra CSV datasets for the coded values (enumerations) too, but how can one make the link between both?
I think it can be done somewhat artificially, but it is not ideal.

B. SQL
SQL is a data definition and query language for relational databases. It is standardized as ANSI-SQL and already very long in place and very much used. In a previous blog I wrote that SDTM is not to be considered as a database, but as a "View" on a database. So instead of obtaining the database itself, the FDA has decided that we must submit "Views" on the database. This has many disadvantages, as in a "View" it is much more difficult to detect data inconsistencies, and it is difficult (if not impossible) to 100% reconstruct the original database. The choice for the "View" surely was surely related to the choice for SAS Transport 5.
So why not submit the database instead of the view on the database? It would have many advantages, as well for the submittors as for the FDA itself.
How can we exchange a database in a vendor-neutral way (one of the requirements)? The answer is "ANSI SQL". The design and metadata of the database could be exchanged using ANSI-SQL "Create Table" statements, with primary key and foreign keys and everything, so also defining the relations between the tables. The actual data could then be exchanged using "SQL Insert" statements.
What would be the advantages? First of all we see that SDTM is in each new version containing new and more derived variables. For example, RFXSTDTC (Date/Time of first Study Treatment) is also already present in the EX (Exposure) dataset. RFXSTDTC was only added to DM because the reviewers are unable to combine both datasets (they just use the viewer which is not "understanding" SDTM). Using ANSI-SQL we could (and we MUST - database "normal forms" require this) get rid of redundant information, which would mean that our data is much more consistent.
Once the FDA would have received the "SQL Create Table" and "SQL Insert" datasets, it could easily import them into a database (the first step in using a data warehouse by the way), with consistencies being checked over the foreign keys. They could then create their OWN views on the data (ANSI-SQL "Create View", such as views very similar to the current SDTM datasets.
This would mean that much less data would need to be submitted (I guess a 50% reduction or more) and that the FDA could easily create any possible views on the data, something they currently cannot do.

Remark that it is still possible to keep the "vertical" table structure as in the SDTM (in the database world, this is called the "entity-attribute-value" (EAV) model. It would also allow to keep the supplemental qualifiers within their parent dataset (a major problem in the current SDTM implementation).

C. Any other?
Would you have had another choice for a vendor-neutral transport format for dataset submission 20 years ago? Please let me know or write a comment.

My personal choice:
If I would have been the FDA twenty (or so) years ago, I would have chosen for ANSI-SQL as a vendor-neutral data submission format. But I am not the FDA, and I was not working in clinical research yet at that time. I do not know whether it has ever been considered (if you know, let me know).

So why not ANSI-SQL as the new transport format today?
Data exchange between computers and especially databases is usually not done using SQL nowadays. The reason is that we have a better alternative, namely XML. XML is not limited to representing flat, two-dimensional data, but has much more features. In a world "that is not flat", it can be used for representing many-dimensional data (though SDTM currently remains "flat" - will it after SHARE?) and it can also be used to e.g. visualize data (using stylesheets) something that cannot be done using the SQL format.

Your comments are welcome, as usual!

Why SUPPQUAL sucks

Following up on my previous post on NSVs (Non-Standard Variables) which are banned to SUPPQUAL datasets in SDTM, I was wondering why it would be so hard (for the FDA) to recombine such NSVs with their parent record (or group of records) in the parent domain.
So I tried it myself.

We are currently working hard on an XML based format for SDTM (and ADaM and SEND) submissions. It is based on ODM "ClinicalData" and "ReferenceData". You can have a sneak preview here. We (one of my Masters student and I) am also working on a "smart" viewer for such data. I will report on the "Smart SDTM Viewer" later.
In the new XML format, we support that NSVs remain in their parent record, i.e. there will just be an "ItemData" for the NSV within the parent original "ItemGroupData" which contains also all the standard variables. In the define.xml, the corresponding metadata "ItemDef" then just has a flag - we use Role="SUPPLEMENTAL QUALIFIER".
Also the "Smart SDTM Viewer" supports this: such columns are simply colored differently.

As we needed test files for new format and the viewer, I took the recently published "Updated Pilot Submission Package" SAS Transport files (available from the "members only" part of the CDISC website) for the famous LZZT study. I converted them into our XML format and then tried to recombine SUPPXX datasets with their parent dataset. For doing so, I wrote an XSLT-2 stylesheet allowing two XML files to be recombined into a single one.

Everything went well for the SUPPDM, SUPPDS and SUPPAE datasets. Trouble started with SUPPLB.
The latter has over 64000 records and has a size of 55MB. The parent LB.xpt dataset has almost 60000 records and has a size of 33MB. This means that there is somewhat more than 1 NSV per LB record.
The fact alone that the size of SUPPLB is larger than that of the parent dataset shows that this format is not very efficient.
Now I started to run the transformation and waited, waited ... waited. I went to bed late in the evening and was surprised the following morning that the transformtion was still running.
Ultimately, it took over 2 days to recombine SUPPLB with LB.

So I looked into my recombination algorithm and tried to improve it. I soonly recognized there was not much room for improvement.

The problem came up again when working on the "Smart SDTM Viewer". It has a feature that when clicking a record in a SUPPXX "tab" or a record in the "RELREC" or "CO" "tab" in the viewer, one can be navigated to the parent record with a few mouseclicks only (try this using the SASViewer...). For FDA reviewers, I presume this will be a "killer feature".
Now I wanted to do the opposite, i.e. each record in the parent dataset that has one or more corresponding records in either a SUPPQUAL, CO or RELREC dataset would be marked (e.g. using another background color). In order to do so, a lookup must be done for each record in SUPPXX, RELREC and CO datasets.
What is the problem?

If you look at the structure of SUPPQUAL (CO and RELREC are very similar), you see that there is an IDVAR ("identifying variable") and IDVARVAL ("identifying variable value"). IDVAR can in principal be any variable from the parent dataset. Usually one chooses for --SEQ, but other popular identifiying variables are --SPID, --CAT, or even --VISIT or --VISITNUM.
Now when starting searching from the SUPPQUAL record, it is easy, as you know (thanks to IDVAR and IDVARVAL) what to look for in the parent domain.
When starting searching from the parent dataset and record however, you do not know (and the computer either) what to look for. You will probably start with USUBJID, and select all records in the SUPPQUAL dataset for that USUBJID. Then for each of these, you read what IDVAR and IDVARVAL are and check whether these correspond with the selected source record in the parent dataset. If there is a match, you create the NSV "ItemData" and add it as the last variable in the parent record (ItemGroupData). If you have 60000 records in the LB dataset, and 64000 in the SUPPLB dataset, all these lookups take a lot of time.

Conclusion: the way SUPPQUAL was developed easily allows to do a lookup from the SUPPQUAL record for the parent record. The other way around however is much more difficult, and will always take a lot of computer time if it has to be done for each record in large datasets such as the LB dataset.
So it is no wonder at all that FDA reviewers are unable to view NSVs together with the parent data in their standard review tools.

I do not know what was in the mind of the developers of SUPPQUAL at that time, but I would now consider it as a major design error.
There could have better choices, which I will explain in a future blog.

Saturday, February 2, 2013

SDTM and non-standard variables

The new draft SDTM-IG 1.4 package 2 contains an interesting statement in the "SDS Proposal for alternate Handling of Supplemental Qualifiers" (file: SDS_Proposal-Alternate_Handling_of_Supplemental_Qualifiers.pdf) document. I cite:

"The Supplemental Qualifiers structure was created to address the need to represent NSVs in the SDTM. It consists of a normalized data structure designed to allow for standard representation of what often is a wide variety of sponsor-specific variables.
The vision at the time the Supplemental Qualifiers structure was created was that standard review tools used at the FDA would automatically display the NSVs together with the parent data in tabular views. The reality, however, is the representation of NSVs in separate SUPP-- datasets has resulted in increased effort by reviewers to prepare for and perform reviews of SDTM datasets.
The end result is that a data structure that was created to provide a standard method for representing NSVs has not been the best structure for the viewing and analysis of SDTM data by FDA reviewers".

Where "NSV" means "Non-standard variables".

Essentially the above statement says that reviewers are not able to combine variables in SUPP-- domains with the parent domain and bring them "back" to the original domain.

I am not surprised.

In conversations with FDA representatives, I heard that SDTM data is viewed by reviewers "as is". Usually there is no attempt to generate a database from the submission datasets. Without that, it is not easy to recombine SUPP-- datasets with the parent domain.
Obviously, there is not much database knowledge at the FDA. I have always regarded their attempt to come to a datawarehouse as "a bridge too far". If they are not able to generate a database, how would they be able to generate a datawarehouse?

The proposed solution that is provided in the above mentioned document is not new: the ODM team already proposed something very similar five years ago, but the teams ideas for an XML-ODM based replacement of SAS Transport 5 were vetoed by the FDA.
I did already write about that in the blog "SDTM Supplemental Qualifiers".

Generating a database from the submission datasets is not made easy by the structure of the SDTM. I already wrote a previous blog on this topic titled "Is SDTM a database and if so - is it a good one".

At the beginning, SDTM was developed to contain "collected data only", and only a minimal amount of "derived data" would be present. The idea was that derived data would go into ADaM.
It however soon was found out that derived data would need to go into the datasets themselves. A simple example is "AGE" in the Demographics (DM) dataset.
AGE is in most cases never collected - it is derived (calculated) from BRTHDTC (birth date) and RFSTDTC (reference start date). Both these are present and required in the DM dataset. So why is AGE than still necessary?
IF SDTM were a database representation, "AGE" would even not be allowed, as it violates the "third normal form", stating that "transitive dependency" is not allowed.
However I suspect that, as it was soon found out that the FDA could not, or did not want to generate a database from each submission, "AGE" was included in the SDTM as a variable of the DM domain.

This turned SDTM into a "View" on a database.
Regenerating a real database from a view on it, is however not an easy task.

In the last few years, we have again and over again seen that new variables are added to the SDTM. In many cases these are derived variables.
So for example, SDTM 1.3 / SDTM-IG 3.1.3 adds following variables to the DM domain:

RFXSTDTC: date/time of first study treatment [can however also be obtained from the EX dataset]
RFXENDTC: date/time of last study treatment [can however also be obtained from the EX dataset]
RFICDTC: date/time of informed consent
RFPENDTC: date/time of end of participation [can very probably be obtained from SE (subject elements) and/or from DS (disposition)
DTHDTC: date/time of death: this is clearly derived - I cannot image this is a standard field on the Demographics form. I haven't seen it in CDASH either
DTHFL: death flag - similar
ACTARMCD: actual arm code - this really does not belong here. It should be a variable in the SE (subject elements) domain. But it looks as it was put in DM as either reviewers do not inspect the SE datasets and/or cannot combine it with the DM dataset.

The most obvious example of a derived variable is "--TEST". There is a 1:1 correspondence between "--TESTCD" and "--TEST", clearly violating the 3th normal form of database design. The somewhat crazy thing is that in define.xml this is not taken into account: there are separate codelists for "--TESTCD" and for "--TEST". As "--TEST" is a synonym qualifier for "--TESTCD", the correct construct should be like:
<CodeListItem CodedValue="SYSBP">
<Decode><TranslatedText>Systolic blood pressure</TranslatedText></Decode>
</CodeListItem>

and "--TEST" should even NOT be in the SDTM dataset.
In a tool that can as well look up the metadata as the data itself, the column "--TEST" would either be generated "on the fly" or the test name (value of --TEST) could e.g. appear as a tooltip when the mouse is positioned over the cell with the test code. With the ancient SASViewer this is of course not possible (as it even does not understand define.xml).

I also have always been puzzled about why we both have --STRESC and --STRESN. For --STRESN the IG says: "copied in numeric format from --STRESC". If the value in --STRESN is a copy of the value in --STRESC, why do we need it then? Maybe to explain the reviewer that the value is a numeric one? But that is already stated in the define.xml (ValueList / ItemDef) isn't it?

So why all these derived variables?

I suspect that the reason is very simple: the FDA is not able to do anything with the submitted datasets than to inspect them with the ancient SASViewer. They are not able to combine datasets with each other (generate views on views), they are not able to generate databases from them.
So each time they need a new "view" they just ask CDISC to define a new (derived) variable that is then added to the standard. The more information the FDA will want to retrieve from the datasets, the more new (derived) variables they will ask for to be included in the SDTM standard.
As can be expected, this will lead to even more "huge" datasets (as the number of variables is continously increasing), and we are then surprised that the FDA is complaining about file sizes!

Is there a way out of this?

I think there is once we have SDTM in XML (ODM-based) format. When both define.xml and SDTM datasets use the same model/format (ODM), we can generate software (e.g. viewers) that allow to "look up" information, or "calculate it on the fly".
For example, for the DM domain, suppose we leave "AGE" out of the dataset. The viewer can then take care that "AGE" is calculated "on the fly" and displayed either as an additional column (although it was not in the dataset), or as a tooltip for one of the cells (e.g. tooltip on USUBJID or on BRTHDTC).
Similarly, for the Findings domains, suppose we leave "--TEST" (test name) out of the dataset and only provide "--TESTCD". As both are connected to each other through the "code / decode" in the metadata, the column "--TEST" can automatically be generated when loading the data from the information in the define.xml. Or the value for --TEST can be supplied as a tooltip on the "--TESTCD" cell in the table.

Once we have such an XML-based format, we can start thinking about deprecating a number of SDTM variables, as it will essentially be the tool (e.g. the viewer) that takes care that they are derived, either from within the same dataset (e.g. AGE) as from other datasets (e.g. RFXSTDTC, RFXENDTC) as from the metadata in define.xml (e.g. --TEST - test name).
This will bring us to a much cleaner SDTM with much less redundant information, and considerably smaller files (I estimate a 50% reduction).

The ultimate solution is however SHARE. Once we can allocate each data point as being a SHARE object, things will become much easier.
There has been a lot of discussion about a format for SHARE. I am however more and more convinced that the format is completly unimportant. I am more and more convinced that SHARE objects can be expressed as RDF, as define.xml and ODM, as Excel (if you like that).

But that's a topic for another blog entry ...

SDTM --ORRESU and controlled terminology

A very interesting discussion (partially by e-mail, partially on the web) has been taken place in the last weeks on the question on whether --ORRESU ("Original Units") must be under control of terminology or not.

The question arose from another discussion on why there is a CDISC codelist for units, although there is already an international standard for that (UCUM), which is well used by the hospital world, and for which the use is mandatory in exchange standards for electronic health records (EHRs) such as HL7-CCD and CDA and ISO-21090.
As I was informed by someone from the FDA, the use of UCUM is also mandatory in SPL (Structured Product Labeling).

The discussion was also triggered by the observation that when one extracts lab data from EHRs and use that in clinical research and it leads to an SDTM submission, the value in LBORRESU in some cases leads to a validation error when using OpenCDISC. A typical example is "mm[Hg]" which is the international symbol (UCUM) for "millimiter mercury" as a unit for the property "pressure". OpenCDISC throws an error as "mm[Hg]" is not in the "UNIT" controlled terminology codelist.
Other people remarked that when using paper case report forms, the way the unit is written on the CRF varies very often (when not preprinted). For example, for a mass concentration, they found it written as "Gr/dl", "Gramm per deciliter", "Gramm/dL", "g/dL" etc.. Only the latter is correct CDISC controlled terminology.
So if I find "Gramm per deciliter" on the CRF, what must go into LBORRESU?
The SDTM-IG for LBORRESU states "original units" for the variable label, but the column "Controlled Terms, CodeList or format" states "(UNIT)", meaning that the value must come from the "UNIT" codelist.
Whether this codelist is extensible (by the sponsor) is not mentioned in the SDTM-IG.
In my opinion, the wording "original units" contradicts with the statement that the "UNIT" codelist must be used.
Only in an ideal world, every researcher would exactly use what is in the CDISC "UNIT" codelist when writing (as narrative text) the unit on the paper CRF.

A response came from a distinguished SDTM pioneer stating that one needs to distinguish between the "actual unit" that came from the CRF, and the way it is represented in SDTM, and which needs to come from the UNIT codelist.
So that would mean that if the researcher had written "mill.Hg" on the CRF, this would never appear in VSORRESU, but needs to be "translated" into "mmHg".
Similarly, when the data comes from an EHR, the UCUM unit "mm[Hg]" will never go into VSORRESU but needs to be translated into "mmHg" again.

If we do so, how much is ORRESU then still "original unit"? Isn't such a translation a "derivation"?
What about "translation errors"? Can we trust that the unit is always translated correctly into one from the UNIT codelist? I (and a lot of other in the discussion) have some serious doubts.

What about tests for which there is a unit available, but for which there is no unit present yet in the UNIT codelist for that property? Do we need to wait submitting to the FDA until the moment that our "new term request" is approved?
And isn't it a bit strange that CDISC just choose to ignore/exclude a standard (UCUM) that is very well used, and choose to develop its own list for units? One statement in the discussion was that UCUM is not well known in smaller labs and by a number of LIMS systems, but is that a reason to develop a whole new list, and then to say that the labs and LIMS should learn that new list, instead of stating that they should learn UCUM? It reminds me about the cartoon that you can find at http://xkcd.com/927/.

One very good suggestion was given by a representative of a biotech company, proposing to have two variables for the "original" units, one named e.g. --RPRESU (new), the "reported result unit" which is "as collected", so not under controlled terminology, and --ORRESU (existing), which is the translation into something that is in the "UNIT" codelist. As such, the "as collected" unit would not be lost, and the reviewer at the FDA can use the (--ORRESU) term that is under CDISC controlled terminology because that is familiar to him/here.

So, when the information comes from an EHR, e.g. "mm[Hg]" (UCUM) would go into VSRPRESU and would then be "translated" into "mmHg" (CDISC-CT) which then goes into VSORRESU.
Which of the two (if not both) would then be made "expected" is also an interesting discussion.

Though this doesn't take away my objections agains "the reinvention of the wheel", I think it could be a very good compromise.

Whether derived variables should go into SDTM at all is another discussion. For example, we must submit "AGE" in the demographics domain although that is usually never collected, but derived from BRTHDTH (birth date) and RFSTDTC (reference start date). Whether such derived variables should be in SDTM is surely related to the question "Is SDTM a database (design), and if so - is it a good one?" which I discussed in a previous blog entry.
My next blog entry will probably go about why SDTM has more and more derived variables.