At the European CDISC Interchange we
once again discussed replacing CDISC controlled terminology for units by
UCUM, the latter being the worldwide standard for units, and used everywhere in
healthcare and in electronic health records (mandatory in HL7-CDA).
Once again, I got the objection that
UCUM is not usable in SEND (Standard for Exchange of Nonclinical Data)
which is especially about preclinical research using animals, bacteria
etc..
The statement about UCUM not being
usable for SEND is just not true. Most people coming with this argument
usually haven't read the UCUM specification, so I will explain here how
UCUM can (and should be) used for SEND.
Consider the following "unit" used in SEND: g/animal/day
it contains 3
components, one not being a unit at all: "animal". As unfortunately in
all CDISC-CT for units, objects or properties (like "animals") have been
mixed up with real units. But sometimes this is just necessary to
compensate for basic errors made in the SDTM and SEND models.
Now, as this
also happens in other industries and sciences, UCUM has developed a mechanism for dealing
with this. It is called "annotations". So the UCUM notation for "g/animal/day" is:
g/{animal}/d
with the "topic" (what it is about) of the unit being in curly brackets, the "annotation".
Also remark that
„day“ is written in another way as in the CDISC notation.
One of the advantages of using UCUM is
that it comes with a machine-executable XML file that describes all the
units and their relations (ucum-essence.xml). So UCUM "knows" that a day is 24 hours, that
an hour is 60 minutes and so on, and that all in a machine-executable
way. So if you would want to calculate how many N g/{animal}/d is in
„milligram per animal per minute“ this can be fully automated (which is
impossible when using CDISC-CT).
The next
objection of the SEND people followed immediately: "yes, but you can
write anything between the curly brackets (defining the that part is an
'annotation'), so we do not have any control anymore what people will
submit at all".
Well, the answer
to that is pretty easy: CDISC should not control "units" like "g/animal/day", it should controll the annotations. So instead of
publishing an ever growing lists containing things like "g/animal/day",
they should be publishing lists of allowed "annotations". I picked out a
few examples from the current SEND-CT that could be done:
{animal}
{cage}
{CAPSULE}
{BAR}
remark that "bar" has in CDISC-CT a totally different meaning than in the rest of the
world: a "bar" in CDISC-CT is a unit of packaging like "a bar of
chocolate" whereas for the rest of the world it is a unit of pressure.
By making clear that it is a "UCUM annotation" however (i.e. putting it
in curly brackets), it is immeditely clear that it is not a unit for
pressure, but something else, and can even be used parallel with the
real "bar" unit.
Another
advantage of having CDISC control over the annotations instead of the "units list" itself is that it allows for flexibility. For example if
someone (e.g. an investigator) needs to have a unit "animals per cage", a
request must be made to CDISC to extend the "UNIT" codelist with a new
term, which takes months, with the possibility that the new term request
is turned down.
When using UCUM
however, with CDISC having controll over the annotations, the term can
be used immediately as "cage" and "animal" are already in the list of
allowed annotations. So the investigator can just use:
{animal}/{cage}
which is a valid UCUM unit.
Now the
investigator realizes that "animals per cage" is not very precise.
He/she has chickens, so what is important for the "density" (a UCUM
property by the way) is the number of chickens per square meter. Instead
of needing to request for a new term once again, he/she can simply use:
{animal}/m2
which is again a
valid UCUM unit with the additional advantage that a computer can
immediately and fully automatically calculate how many "animals per
square yard" that is.
The investigator
however also works with birds that can really
fly (in contrast to chickens). So in that case, the density is better defined by the number of
birds per cubic meter. Without needing to do a request for a new term
again, he/she can now write:
{animal}/m3
which again is fully automatically interconvertable to e.g. „number of animals per gallon“.
So my proposal to
CDISC is: discontinue the development of this ever growing list of units
(that are not units) and that are not interconvertible by computers.
Start using UCUM and publish lists of allowed annotations. For each SEND
(but also SDTM) variable, CDISC can then publish a list of (one or
more) "strongly preferred" units. For example for "height of subjects"
(DM.HEIGHT):
cm
m
[in_i]
(the latter is
UCUM for "inches"). This set is a valid set of UCUM units which are fully
interconvertible by computers (UCUM „knows“ that an inch is 2.54cm –
CDISC-CT does not have that in machine-executable code)
Or for a SEND variable that describes the amount of food for the animals:
g/{animal}/d
g/{animal}/wk
which are all
valid UCUM units with the additional advantage that even when the
investigator has been collecting the amount of food as "gram per animal
per month" (g/{animal}/mo) this can be fully automatically recalculated in one of the above.
Comments are very welcome as always