On the last day of this year, I was working on a new software offering for the generation and validation of define.xml files. It is meant for those people and organizations that do not use SDTM-ETL or the SAS Clinical Standards Toolkit or similar, i.e. for people that need to generate their define.xml "post-SDTM-generation". The latter is far from ideal, but is still common in (too) many organizations. The tool is also meant for people who are sick and tired of generating define.xml from an Excel worksheet through a "black box" tool for which there is no documentation, no user-friendly graphical user interface, and produces "garbage define.xml" in many cases. Those who use this tool know what I mean ...
So, when working on the tool, I was confronted with an issue that bothers me already for a number of years: SDTM labels.
In the SDTM-IG, one finds a label for each variable that is at least recommended for the specific domain. Additional standard variables can be added, and for these, one needs to do a lookup in the "Study Data Tabulation Model" document, look up the (prototype) variable there, and take the label from there.
So far so good. Then run the usual validation tool (also used by the FDA) on the define.xml , and you might get a good number of errors "Error: Variable Label in standard domain must match the variable label described in corresponding CDISC implementation guide" (remark that the error message is itself in principle wrong because it is also applied to standard variables that are not in the SDTM-IG but are in the SDTM document). Let the puzzling start!!!
Some of my customers spend hours on such error messages, finally finding out that there was an "uppercase-lowercase" mismatch in a single character in the label. For example "Dosing Frequency Per Interval" versus "Dosing Frequency per Interval" (this is an easy one). Or they found out that the difference was a dot, like in "Character Result/Finding in Std. Format" versus "Character Result/Finding in Std Format".
Let us do the following imaginary experiment: add a dot at the end of each of your labels in your define.xml and then submit it to the FDA. You will probably get your submission back with the message that it did not pass the "data fitness" programm. Is your submission really of bad quality because of the dot at the end of each variable? Did that dot change the meaning of your variables?
I also had customers complaining about that they got errors even when making very minor changes to the SDTM labels because the minor change much better explained the reviewers what the variable was about. Although the minor change considerably improved the quality of their variable description and so of their submission, the validation tool that is used by the FDA said it the other way around.
They (in my opinion correctly) argued that the validation software should not give an error, as the SDTM-IG nowhere states that labels must be provided exactly as given in the IG (I haven't been able to find such a statement either...).
In my opinion, validation tools should be based on "risk evaluation" or "risk assessment". I do not know yet how that would need to work in the case of SDTM and define.xml. I do already see however how we can take a few steps in that direction. One of the things I would like to see for SDTM variable labels is that there is an "label equality assessment" done between the "expected" label (from the SDTM-IG or SDTM standard), and the "actual" label, quantified by a "label equality percentage". So in my new software, when validating the SDTM labels in the define.xml for compliance, I implemented the following stages:
- the "expected" and "actual" label are (case sensitive) identical. No problem => full compliance
- the "expected" and "actual" label are (case insensitive) identical, meaning that there is only an "uppercase-lowercase" problem (so semantically, they are equal) => an appropriate warning is generated
- the "expected" and "actual" label are different => calculate the "likeness" or "equality" number
So, how do I quantify "how much" the label was changed? I did a bit of research and found the "Levenshtein distance between two Strings": it counts the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one sentence into the other. This (positive integer) number can then easily be transformed into an "equality number" or "likeness" using the following formula:
equality number = 1.0 - (LD / max(string-length(s1),string-length(s2)))
where LD = Levenshtein distance
s1 = first string (i.e. "actual" label)
s2 = second string (i.e. "expected" label)
and is a number between 0.0 (completely different labels) and 1.0 (completely identical labels).
A few examples are given below:
SDTM variable | Expected Label | Actual Label | Equality Number |
---|---|---|---|
LBTESTCD | Lab Test or Examination Name | Laboratory Test or Examination Name | 0.80 |
MBSTRESC | Character Result/Finding in Std Format | Character Result/Finding in Std. Format | 0.97 |
TADTC | Date/Time of Accountability Assessment | Date/Time of Drug Accountability Assessment | 0.88 |
MBSTRESC | Character Result/Finding in Std Format | The quick brown fox jumps over the lazy dog | 0.12 |
As one can see, this "equality number" or "likeness number" much better quantifies how "much alike" the provided label is with respect to the (from the SDTM-IG or SDTM) expected label. Much better than the one implemented in the by the FDA used validation software where the outcome can only be "0" (not identical) or "1" (identical, case sensitive).
And this is the way I implemented it in my software.
Another small step in coming to "smart" validation software tools...