Wednesday, September 7, 2022

CDISC Standards Validation: Why we need CDISC CORE

I regularly have a peek on the "Pinnacle21 Forum" as I can often help people with Define-XML questions or issues, as I am one of the Define-XML standard developers myself.
I can also often help people with SDTM questions, questions that otherwise remain unanswered for days or weeks by P21 itself.

Yesterday, I found an entry that once again made me sad and angry at the same time ...
Here it is:

and the answer from P21:


So, I went looking into rule SD1339 myself. From the P21 website listing the validation rules, we find:


The rule doesn't say anything about what is exactly meant by the "start date" and the "collection date" and which of both must be used for the basis for the epoch. It also shows that it is an "FDA-only" rule, so not originating from CDISC nor from any other organization and surely not from a company.
So I looked into the FDA publication itself, which can be downloaded as an Excel file from here. For rule SD1339/FDAB022 we find:

which even provides less details ...

So, is this a good validation rule?

How should a good validation rule look like?

Anyone working in the rule validation world knows that there are some very important parts in rules that should always be present, one being the "preconditions":

 The preconditions explain (lists) under which conditions the rule is applicable. For SDTM this means "for which domains and which variables the rule is applicable". In the P21 rule description, this information is available, in the FDA description it is not well (i.e. not sufficiently detailed) described.
However, it must also be described (also part of the preconditions) which dependencies exist. For SDTM this means that for Findings domains, there is a dependency on the collection date (--DTC) whereas for Events and Interventions domains there is a dependency on the start date (--STDTC).
This information however is neither provided in the P21 nor in the FDA information for rule SD1339/FDAB022.

Why is this so important?

If FDA publishes rules, sponsors must obtain "legal certainty". They must be 100% sure that for each and every case, not only whether the rule applies or not (no "wiggle room"), but also that the outcome of applying the rule is 100% correct.
With a (vague) rule description as above, this is however not possible at all.

So, one wonders, who at the FDA developed this rule, and why it is described in such an incredible vague way? When one asks the FDA, one does not get an answer.

And, if the rule description by the FDA is so vague, how does it come that P21 has a somewhat more precise (although still insufficient) description of the rule? Do they know more than other parties (like other software vendors) are allowed to know?

The suspicion arises that these validation rules were not developed by the FDA, but by an external company on behalf of the FDA. There is nothing against this, if the rules developed by external provider have undergone quality control by the FDA and meet all quality criteria for validation rules, such as a very precise description of the "preconditions" and "dependencies".
From the very vague "rule description" in the FDA Excel worksheet, it looks as such a QC and validation by the FDA was however never performed.

Such a procedure (or lack of a good one) easily leads to a so-called "black box" implementation:

When the external company who developed the rules on behalf of the FDA (and publishes them on their website) then develops, distributes and sells software that implements these rules, but without publishing the details of how these rules were implemented (in this case the dependency on --DTC and --STDTC), it is impossible for the user of the software to understand (i.e. "black box") what preconditions the software uses to decide whether a record violates the rule or not. As one can understand, this can easily lead to "false positives".
Essentially, "legal certainty" cannot provided at all this way.

Another example is the rule SD0026/CG0425 "Missing value for --ORRESU, when --ORRES is provided". From the P21 website:


Again, no preconditions, i.e. when the rule applies and when not, or what the dependencies are, are provided.

Everyone with some knowledge of SDTM will immediately scream that this rule is nonsense! There are so many situations where no unit in --ORRESU should be provided, e.g. for VSTESTCD=FRMSIZE (frame size). The expected values from VSORRES for VSTESTCD=FRMSIZE are "SMALL", "MEDIUM" and "LARGE". There is no unit at all.

But if we run the validation software, we do not get a warning when VSORRESU is not populated for VSTESTCD=FRMSIZE. According to the rule description however, the software should throw a warning.
So, it looks as some preconditions have been build into the software which we (i.e. everyone else except for P21) do not know about. And we cannot find out either as the source code of the software is not public.

Again, do we get "legal certainty" this way?

Why we need CORE

And this is exactly why we need CDISC CORE.
For each rule, CORE will publish the exact implementation (with all preconditions) as "open source", and everyone (including software vendors) can exactly see how the rule is (or needs to be) implemented. So, no "hidden" preconditions anymore...
These rules are and will be published in YAML format. This has the advantage that they are easy to read and understand. An example:

At the same time they are however also machine-executable.

The CORE rules are currently being developed by a large group of volunteers. The large number of people involved (especially from sponsor companies) surprised me strongly. Usually it is very hard to find volunteers for technical work. The very well organized way the rules are developed, together with the choice for YAML seems to have created a "low threshold" for people wanting to contribute. Among the large number of people currently developing the rules, we see a lot of non-programmers, but who have a solid knowledge of (and especially large experience with) the CDISC standards.
That there are so many volunteers also shows that there really is a need for CORE (also for FDA rules), and, in my interpretation, that many sponsors are sick and tired of having to work with "black box" software and with rule sets that are non-transparent, having many "hidden" preconditions and generating lots of "false positives".

Recently, I decided to join the project as a volunteer, my first task being to work on the SEND rules for SENDIG-3.1, described in the "SEND Conformance Rules v4.0".
I worked already on SENDIG rules several years ago, but then using an XQuery implementation. This set of rules (in XQuery format) is still available as open source. The problem with XQuery however is that it is only applicable to CDISC's own Dataset-XML format, which however (even after 10 years) hasn't been adapted yet by the FDA, although it would solve many problems.

This also meant that the impact of these "Open Rules" was very limited ...
CORE will however essentially be "submission format neutral". The "Minimum Viable Product" will concentrate on SAS Transport 5 format. It will however be very easy to also implement it for other formats such as CSV (comma-separated values) and the new CDISC Dataset-JSON format for which there soon is a "Hackathon". I say "very easy to implement" as for modern formats such as JSON and XML, there are very many libraries available and millions of developers knowing how to work with these formats, this in contradiction to SAS Transport (did you ever try to work with XPT files outside SAS?).

Other advantages of using CORE

One of the advantages of using CORE will also be that users can select (filter) on rules. This has nicely been demonstrated by Nick de Donder as part of a CORE webinar. With P21 it is "all or nothing" meaning the P21 is very hard to use during the development of the datasets, when some information is still missing.

Another great advantage is that sponsors and service providers will be able to develop additional rules "on top" of the existing CDISC, FDA and PMDA rules. This will be extremely useful for checking internal rules, such as development rules.
This is currently completely impossible with the P21 software.

As we already have a lot of experience in the "open" implementation of CDISC-related validation rules, and we are are currently gaining experience with CORE and the YAML implementation, we will be offering services in the future to sponsors and service providers to extend the rules with own (and additional) ones.


CORE will really be needed also for FDA rules. If the FDA decides to cooperate with CDISC on this, for the first time in history, we will have FDA validation rules that are fully transparent, including all preconditions and dependencies (currently "hidden"), and for the first time in history, this will finally create "legal certainty" for sponsors when validating their submission datasets against the FDA rules.

Any vendor (or even open source initiative) can then start offering an implementation in software of the CORE-FDA rules. As CORE delivers the "reference implementation", every compliant software implementation will need to provide the same outcome for the same (test) case, otherwise the software is not "CORE compliant". This also means that it will not matter anymore which CORE-compliant software the sponsor is using, and which CORE-compliant software the FDA is using, as, due to CORE being the "reference implementation", both must always provide exact the same outcomes. Both FDA and sponsors are then free to select any CORE-compliant software for their validation work. This is especially important for sponsors, allowing them to move away from "we must use the software from vendor X, as also the FDA is using it". It allows them to select validation software based on quality, user-friendlyness, extensibility (e.g. adding internal rules), and last but not least, cost.

Some final words

As CORE is all "open source", part of the "CDISC Open-Source Alliance" (COSA) initiative, I expect that many software vendors will start offering CORE-compliant validation tools, or tools with CORE being integrated, such as mapping tools, this time not being based on "black box" implementations of rules, also causing many false positives.

So, this time there will be real competition, based on quality (also of services), rather than being based on "what the FDA is using".