Tuesday, August 30, 2016

Why --LOBXFL should not be in SDTM

In my previous post from last week, I argued that the new SDTM variable --LOBXFL (Last Observation Before Exposure Flag) should not be in SDTM, as it is a derived variable, and can easily be calculated "on the fly" by review tools.
I also promised to implement such a "on the fly derivation" in the Open Source review tool "Smart Dataset-XML Viewer". The latter already has features for "on the fly" calculation of other derived variables like EPOCH and --DY.

It took me about 6 hours to implement the new feature "highlight last observation before exposure" in the Viewer. When the user now selects "Options - Settings" and navigates to the "Smart features" tab, a new option becomes visible:

The new option is near the bottom "Derive and Highlight Last Observation records before first Exposure". Essentially, this corresponds to records where the future value of --LOBXFL is "Y", but here, derived "on the fly" instead of relying on the flag in the record.
Also remark the first two checkboxes, which allow "on the fly" derivation of first and last study treatment (based on EX) and displaying that as a tooltip on the DM record, essentially making RFXSTDTC (Date/Time of First Study Treatment) superfluous.

When checking the checkbox "Derive and Highlight Last Observation records before first exposure", a new dialog is displayed, asking the user to choose between two additional options:

It asks the user whether the derivation of the "last exposure before first treatment" records should either be based on trusting RFXSTDTC in DM, or that the tool will also retrieve the "first exposure" for each subject from EX. The best is of course to use the second option, as reviewers should essentually make their own judgements and not rely on derived information (which may be erroneous) submitted by the sponsor.
However, to demonstrate this, let us "trust" the submitted value of RFXSTDTC (Date/Time of First
Study Treatment).

After loading the SDTM submission datasets, let us have a quick look at the DM dataset. Here it is:

One sees that the first and last date/time of study treatment exposure are displayed as a tooltip on the "USUBJID" cell, making RFXSTDTC and RFENDTC superfluous (also remark that subject 1057 was a screen failure). For subject 1015, the date/time of first exposure is 2014-01-02, as derived from the EX records.

Let us now inspect the VS (vital signs) dataset. I moved some columns around (another of the many features of the viewer) to obtain a more "natural" order of the variables.

One sees that 3 records for DIABP (diastolic blood pressure) are highlighted. Their VSDTC (date/time of collection) is identical and equal to the first treatment date.
This already leads to a first discussion point about baseline flags, which is a discussion about data quality: if treatment and observation points are not precisely collected (i.e. including the time, not only the date), one cannot always know whether an observation was made before or after the first treatment. In this case, one only knows the observations were made ON the same day as the first treatment.
Also, we see that the sponsor assigned baseline flags (VSBLFL=Y) are correct.

Let us look somewhat further in the table:

We see that the last observation for "HEIGHT" before first study treatment is highlighted, and we see that 3 records for "PULSE" (Pulse Rate) are highlighted. We however also see that for the highlighted "HEIGHT" record, the sponsor did not set a baseline flag. It might have been forgotten, or it was decided that "HEIGHT" is irrelevant for the analysis of this study. A reviewer may judge differently.

For the second subject (1023), we find:

Something is strange here! The first three records for DIABP are marked as "last observation before first study treatment", but the baseline flags set by the sponsor are not on these records, but appear for the observations in the next visit.

What happened?
Did the sponsor assign the baseline flags incorrectly? Or did something else happen?
Another possibility is that RFXSTDTC was incorrectly derived by the sponsor (in DM), and as we decided to base the "on the fly derivation" on RFXSTDTC (which reviewers should i.m.o. not do), the "last observation records" are incorrectly assigned.

So let's not trust the submitted RFXSTDTC and let the tool derive it from the EX records:

And then inspect the generated table for subject 1023 again:

We now see that the highlighted records (derived "on the fly") now correspond to the records for which the sponsor set the baseline flag to "Y".

If we go back to the DM record for this subject, everything becomes clear:

We see that some way or another, the value for RFXSTDTC was not correctly assigned by the sponsor. It states "2012-08-02", whereas the real first exposure date/time (derived "on the fly" from EX and displayed on the tooltip) is "2012-08-05".


These results show again that:
  • derived variables should NOT be in SDTM, as they can easily be calculated or derived "on the fly" by review tools
  • derived variables mean data redundancy, which is always bad in data sets: if two values for the same data point differ in value, one can never know which one is incorrect
  • reviewers should NEVER, NEVER make decisions based on derived variables that were submitted by the sponsor, be it baseline flags, --DY values, or EPOCH values. They should use their tools for deriving them themselves directly from the real source data.
  • implementing such "on the fly derivations" in review tools is "piece of cake". It took me just 6 hours to implement the current one in the "Smart Dataset-XML viewer". Implementing other similar features even costed me less time.

I still need to clean up my source code a bit, and will then publish a new version of the software, including the source code, on the SourceForge project website. Once done, I will let you know through a comment.

As usual, your comments are very welcome.

Also read the follow-up post "--LOBXFL: a follow up"

No comments:

Post a Comment