At the European CDISC Interchange we
once again discussed replacing CDISC controlled terminology for units by
UCUM, the latter being the worldwide standard for units, and used everywhere in
healthcare and in electronic health records (mandatory in HL7-CDA).
Once again, I got the objection that
UCUM is not usable in SEND (Standard for Exchange of Nonclinical Data)
which is especially about preclinical research using animals, bacteria
etc..
The statement about UCUM not being
usable for SEND is just not true. Most people coming with this argument
usually haven't read the UCUM specification, so I will explain here how
UCUM can (and should be) used for SEND.
Consider the following "unit" used in SEND: g/animal/day
it contains 3
components, one not being a unit at all: "animal". As unfortunately in
all CDISC-CT for units, objects or properties (like "animals") have been
mixed up with real units. But sometimes this is just necessary to
compensate for basic errors made in the SDTM and SEND models.
Now, as this
also happens in other industries and sciences, UCUM has developed a mechanism for dealing
with this. It is called "annotations". So the UCUM notation for "g/animal/day" is:
g/{animal}/d
with the "topic" (what it is about) of the unit being in curly brackets, the "annotation".
Also remark that
„day“ is written in another way as in the CDISC notation.
One of the advantages of using UCUM is
that it comes with a machine-executable XML file that describes all the
units and their relations (ucum-essence.xml). So UCUM "knows" that a day is 24 hours, that
an hour is 60 minutes and so on, and that all in a machine-executable
way. So if you would want to calculate how many N g/{animal}/d is in
„milligram per animal per minute“ this can be fully automated (which is
impossible when using CDISC-CT).
The next
objection of the SEND people followed immediately: "yes, but you can
write anything between the curly brackets (defining the that part is an
'annotation'), so we do not have any control anymore what people will
submit at all".
Well, the answer
to that is pretty easy: CDISC should not control "units" like "g/animal/day", it should controll the annotations. So instead of
publishing an ever growing lists containing things like "g/animal/day",
they should be publishing lists of allowed "annotations". I picked out a
few examples from the current SEND-CT that could be done:
{animal}
{cage}
{CAPSULE}
{BAR}
remark that "bar" has in CDISC-CT a totally different meaning than in the rest of the
world: a "bar" in CDISC-CT is a unit of packaging like "a bar of
chocolate" whereas for the rest of the world it is a unit of pressure.
By making clear that it is a "UCUM annotation" however (i.e. putting it
in curly brackets), it is immeditely clear that it is not a unit for
pressure, but something else, and can even be used parallel with the
real "bar" unit.
Another
advantage of having CDISC control over the annotations instead of the "units list" itself is that it allows for flexibility. For example if
someone (e.g. an investigator) needs to have a unit "animals per cage", a
request must be made to CDISC to extend the "UNIT" codelist with a new
term, which takes months, with the possibility that the new term request
is turned down.
When using UCUM
however, with CDISC having controll over the annotations, the term can
be used immediately as "cage" and "animal" are already in the list of
allowed annotations. So the investigator can just use:
{animal}/{cage}
which is a valid UCUM unit.
Now the
investigator realizes that "animals per cage" is not very precise.
He/she has chickens, so what is important for the "density" (a UCUM
property by the way) is the number of chickens per square meter. Instead
of needing to request for a new term once again, he/she can simply use:
{animal}/m2
which is again a
valid UCUM unit with the additional advantage that a computer can
immediately and fully automatically calculate how many "animals per
square yard" that is.
The investigator
however also works with birds that can really
fly (in contrast to chickens). So in that case, the density is better defined by the number of
birds per cubic meter. Without needing to do a request for a new term
again, he/she can now write:
{animal}/m3
which again is fully automatically interconvertable to e.g. „number of animals per gallon“.
So my proposal to
CDISC is: discontinue the development of this ever growing list of units
(that are not units) and that are not interconvertible by computers.
Start using UCUM and publish lists of allowed annotations. For each SEND
(but also SDTM) variable, CDISC can then publish a list of (one or
more) "strongly preferred" units. For example for "height of subjects"
(DM.HEIGHT):
cm
m
[in_i]
(the latter is
UCUM for "inches"). This set is a valid set of UCUM units which are fully
interconvertible by computers (UCUM „knows“ that an inch is 2.54cm –
CDISC-CT does not have that in machine-executable code)
Or for a SEND variable that describes the amount of food for the animals:
g/{animal}/d
g/{animal}/wk
which are all
valid UCUM units with the additional advantage that even when the
investigator has been collecting the amount of food as "gram per animal
per month" (g/{animal}/mo) this can be fully automatically recalculated in one of the above.
Comments are very welcome as always
Thanks for the helpful examples, Jozef. The case for UCUM seems very solid to me.
ReplyDeleteThere seem to be three objections to the use of UCUM:
The first is that whilst UCUM provides documentation of prefixes, core elements, and symbols that can be used to create 'standard' unit terms for consumption by the biomedical and healthcare industry, it does not provide a finite list of standardised, electronically consumable units of measure terminology. The custodians of UCUM would say that is because that list is, to all intents and purposes, infinite. Nonetheless, that does not preclude CDISC from creating a UCUM compliant list, following the UCUM rules. The CDISC unit codelist is not UCUM compliant.
The second is that unit fragments contained within UCUM do not cover the breadth of units required for all data submissions to FDA. I think you have nicely shown that not to be the case. One of the issues we have in clinical research is that the definitions of data elements and terminology are often ambiguous. The UCUM approach is explicit and eliminates any such ambiguity. I really like that.
The third is that UCUM has multiple terms with the same meaning (e.g. g/l and mg/ml). I see the complete coverage that UCUM provides as a benefit, not a hindrence. Often there is not worldwide agreement on preferred units (e.g. subjects’ weight is measured in kg in some countries and in lbs in others). When GSK use local labs in studies (particularly cancer studies), we have to deal with very many units (i.e. units that would not be found in the CDISC terminology). In particular, we spend quite a bit of time determining conversion factors from esoteric units to our preferred unit. We also have to deal with cases where there is a business need to convert data from preferred units to other units. It would be much nicer to have an (industry) tool that could handle that for us - and that would require compliance with UCUM. None of that precludes CDISC from deciding which of the UCUM compliant units are the CDISC preferred units and recording that in the form of a terminology.