Friday, July 26, 2019

CDISC Validation: PointCross – Pinnacle21 v.3 comparison: Part 4: some more hot topics

In the fourth part of our series, we look at some more "hot topics" that were real problems with the Pinnacle21 validation software in prior versions. We will look into whether these were corrected in v.3.0.0 and compare how the new MySEND from validation software from Pointcross life sciences treats these cases.
We will especially concentrate on two major groups of items: uniqueness of records and when and how --ORRESU and --STRESU values ("Original Result Units" and "Standardized Result Units") must be populated in SDTM and SEND.

Uniqueness of records based on dataset keys defined in the define.xml

Some time ago, there was a report of false positive errors when using Pinnacle21 v.2.2 when people choose to have different keys than the examplekeys in the SDTM-IG.
As a test, I changed the definition for AE in a define.xml file and removed the key assignment for AESTDTC. I then added a second "Anxiety" record for the same patient with clearly different start and end dates.
According to the (remaining) keys defined in the define.xml (STUDYID, USUBJID,AEDECOD) this should lead to a "duplicate records" error. 
Neither MySEND nor Pinnacle21 v.3.0 reported any issues. So it looks as both packages ignore the key definitions from the define.xml and use their own ones (whatever these are remains a secret to me…).
The other way around, I duplicated a record in my LB (Laboratory) dataset for "Vitamin B12" and got a "duplicate records" error for both the validators. In MySEND, it reported that this was based on the combination of USUBJID, LBTESTCD, LBDTC, LBSPEC, LBMETHOD, VISITDY, LBDY, LBCAT, VISITNUM.
For Pinnacle21 it reported that this was based on the combination of USUBJID, LBTESTCD, LBCAT, LBSPEC, LBMETHOD, VISITNUM, VISITDY, LBDTC. The difference between both is LBDY.
However, I never declared LBDY as a dataset key in my define.xml, nor did I for LBSPEC.
This seems to confirm that in both packages, the "uniqueness keys" from the define.xml are mostly or completely ignored.

I then changed the data to contain LBTPT, LBTPTNUM, and gave the two records that were reported to be non-unique different values for LBTPTNUM and LBTPT, and registered LBTPT as a key variable in the define.xml. The errors should now go away. It did in the case of Pinnacle21. When I gave both records the same value for LBTPTNUM (making them non-unique again), no issue was reported however.
For MySEND, the error also went away when adding LBTPT and LBTPTNUM, and assigning a key to LBTPTNUM and having different values for LBTPTNUM.
When the two records have the same LBTPTNUM however, also no error is thrown. The reason is probably that I needed to assign different values for LBDTC, as otherwise I did get some other errors.
All together, it looks as both MySEND and Pinnacle21 have their own rules for defining what "record uniqueness" is, and ignore the key definitions from the define.xml. This is essentially a completely false approach, as the define.xml is "the sponsor's truth" and the keys should solely be taken from the define.xml.

Uniqueness of records based on TESTCD, SPEC, METHOD …

One of the major problems of the use of (post-coordinated) CDISC-CT for laboratory tests is that there is no perfect way to automatically detect uniqueness of records without using the keys from the define.xml, and even then ... Suppose the following situation. A subject is tested on glucose in urine using a dipstick method. The result is "+2" (ordinal). As the test is positive, a quantitative test is performed also using a dipstick method, but then quantitative, with a result of 2 mmol/L. Both records in SDTM have the same values for LBTESTCD ("GLUC"), LBSPEC ("URINE"), and LBMETHOD ("DIPSTICK"), and even the same value for LBDTC, as both tests were done on the same sample (LBDTC = "Date/Time of Specimen Collection"). The major difference however is the LOINC code. In the first case (ordinal), the LOINC code is 25428-4 whereas in the second case (quantitative), the LOINC code is 22705-8.
How do both the validation tools treat this situation? In our define.xml, we set "LBLOINC" as one of the keys (using "KeySequence") to tell the system (and of course the reviewers) that this was a key in our database. Do the validation tools accept our choices for the uniqueness keys?
Pinnacle21 v.3.0.0 reports a warning for one of the records: 

It seems that it overrides our own choices for the uniqueness keys defined in the define.xml (yes,  we did indeed add define.xml using the GUI) by "their own" (i.e. what Pinnacle21 think should be the keys).
MySEND also gives an error with a somewhat different message:

It argues that there cannot be 2 measurements with the same datetime of collection. However for LB, LBDTC is the datetime of the sample collection (and not of the measurement). So if two measurements were performed on the same sample, these should never be marked as "duplicate records".
So also here, a false positive error, as also here, the choice of the uniqueness keys in the define.xml is fully ignored.

Units: "ORRESU when ORRES is provided" and "STRESU when STRESC is provided"

One of the most contested rules are the FDA rules "Missing value for --ORRESU, when --ORRES is provided" (SD0026) and "Missing value for --STRESU, when --STRESC is provided" (SD0043). We all know that there are so many cases where there is no unit. Simple examples are "pH", and all tests that provide ordinal or narrative values. So, these rules just don't make sense
In our test set, the Hematocrit (LBTESTCD=HCT) values are provided as fractions, e.g. "0.43", without unit. For all such records, both MySEND as Pinnacle21 state there is an issue there, referencing the above-mentioned rules. Probably, both packages assume that all hematocrit values are of have to be reported with "%" as the unit, but I have nowhere found such a rule. Also, in both packages, the rule seems not to be applied when the value of LBSTRESC cannot be converted to a number. So, on what basis is it decided whether the rule is applied? On the value itself? That is and remains intransparent.
An improvement relative to the past is already that no such errors are thrown when the test is "pH" (LBTESTCD=PH). We all know (or should know) that there is no unit for pH, as it is the logarithm of a ratio.
Essentially, these two rules should not exist: they are nonsense, and there is currently no correct way to 100% accurately describe when a test has a unit and when not, especially not using CDISC coding systems.
The case may be however be different when using LOINC codes.
Essentially, the LOINC code itself contains indirect information about whether a unit is expected. First of all, one of the five parts of the "LOINC Name" is the "scale". If the value for "Scale" is "Qn" (meaning "Quantitative"), there is already is a good chance that there is a unit. But not always… If the value of the part "Property" is either "VFr" (volume fraction) or "MFr" (mass fraction), there might be a unit or not. LOINC also provides an "example UCUM unit" when there is one available, but it does not mean that that unit must be used.
For our "hematocrit" example, the LOINC code 4544-3 of Blood by Automated count" has an example (UCUM) unit "%", but given that the "property" is a "fraction", it does not mean that "%" needs to be used. Essentially, it should be possible to develop a set of rules to determine whether LBORRESU must have a value, based on the LOINC code and/or the UCUM notation. For example, "%" as UCUM unit cannot be reduced to a combination of the 7 base units, neither can "[pH]".
I did not find any indications however that either MySEND or Pinnacle21 ever did any attempt to develop such rules. As long as such a clear set of rules is not publicly available, it does not make sense to implement rules SD0026 and SD0043: they should simply be removed.


Both packages unfortunately seem to ignore the "uniqueness keys" provided by the define.xml (the "sponsor's truth") and have their own (partially intransparent) rules of what is understood under "unique". One argument I heard in the past for not taking the keys from the define.xml is that "many define.xml files are not correct". That however is "the world upside down" and is like saying: "many people ignore the red traffic light, so let us remove all traffic lights".
Both packages try to implement the FDA rules SD0026 and SD0043, although these rules should not exist as there is currently now way to correctly find out whether a test has a unit or not. Even with the help of the LOINC code this remains tricky, as the examples with mass fraction or volume fraction shows, where "%" is a valid unit, but where the fraction can also be a number between 0 and 1, without unit.

Next time we will report on a "code review" of the Pinnacle21 v.3 CLI source code (PointCross MySEND is not open source).

Prior in the series:
Part 1: Installation
Part 2: Validation features
Part 3: Hot topics 1

Wednesday, June 26, 2019

CDISC Validation: PointCross – Pinnacle21 v.3 comparison: Part 3: hot topics 1

In our third part, we make comparisons between Pinnacle21 Validator 3.0 and MySEND 1.0 for some "hot topics", i.e. topics that were highly problematic in earlier versions of the Pinnacle21 validator (for MySEND, we can't say, as there are no earlier versions).


A "hot topic" coming back over and over again in the Pinnacle21 user forum are the labels for variables and datasets. The famous message "label mismatch" (35 "hits" in the forum) is very well known… One of the reasons is that in some cases, variable labels have been published that are longer than 40 characters (the limit for SAS-XPT), and that Pinnacle21 than took the freedom to define itself what the label should be. The most famous example is for the variable PESTRESC.
With the recent release of the "CDISC Library" which is (according to CDISC itself) the "CDISC truth", this issue should essentially be resolved. An overview of our test results is given below.

SDTM-IG version
PESTRESC Label according to
SDTM-IG and CDISC-Library
Validation Result MySEND
(FDA rules)
Validation Result Pinnacle21 v.3.0
(FDA rules)
Character Result/Finding in Std Format
Character Result/Finding in Std Format
Character Result/Finding in Standard Format
SDTM/dataset variable label mismatch
*1 Pinnacle21 v.3.0 seems to compare the variable label with the "ItemDef Description" from the define.xml when define.xml is provided. In case there is a mismatch between them, it gives an error with a clear error message.
In case no define.xml is provided, it does not give an error for the label "Character Result/Finding in Standard Format". In case the define.xml is present and the label is "Character Result/Finding in Standard Format" in both, no error is thrown.
In case the "label" is completely wrong in both the dataset as in the define.xml (e.g. using "test" for the label for PESTRESC) it gives an error "SDTM/dataset variable label mismatch".

So it looks as Pinnacle21 seems to have made progress here, not throwing an error anymore when the label for PESTRESC does not correspond to what they think it should be, whereas MySEND still seems to follow what Pinnacle21 did in earlier releases.

Remark that the 40-character limitation is an artificial one, due to that the FDA (and PMDA) still require the completely outdated XPT format to be used. In modern times, the transport format is independent from the content standard and does not limit it in what content can be. XPT is a disaster in this sense. HL7-FHIR however shows how it can be done: one standard, three transport formats (XML, JSON, Turtle).

Order of variables: EPOCH in SV ("Subject Visit")

Another hot topic that pops up over and over again is the correct order of variables in a dataset, especially when "timing variables" are added to an "observation" dataset.
The correct order for timing variables is:


Searching for "wrong order" on the Pinnacle21 forum leads to 35 entries.
Just as an example, there was a complaint on the forum about the correct order of "EPOCH"in SV (Subject Visits). "EPOCH" is not described for "SV" in the SDTM-IG. But the FDA wants it anyway (omitting it leads to a validation error). So it needs to be inserted. The author of the entry did put it after "VISIT" and before "SVSTDTC" (VISITDY and "TEATORD" were absent), which seems perfectly ok. He/she still got an " SD1079" error.

Unfortunately, there was no reaction from Pinnacle21 at all.

However: SV is a "special purpose" domain, not a "domain based on the 3 general observation classes", so one may wonder whether the rule is really applicable to SV anyway. There is also no other indication of what the order should be in SV.
So, as a test, we added "EPOCH" after "VISITDY" and before "SVSTDTC" in our test dataset, and looked what the validator says about it (this was a false positive error in Pinnacle 2.2.0 see e.g.

We re-generated the SV dataset, NOT using SAS-software. The SV dataset then contained 17 records.
When using Pinnacle21 v.3.0 (using FDA rules, SDTM-IG 3.2, and XPT for the format), we did not get any error or warning regarding the order of the variables. A bit surprisingly, we did get errors that there are "null" values for STUDYID and USUBJID for record 18, this although we only have 17 records. When using Dataset-XML for the format, then this error disappears.

When using MySEND (using FDA rules, SDTM-IG 3.2, and XPT format), we did get a message "Model permissible variable added into standard domain". It does not say however whether it is an "info", a "warning" or an "error". This message was a typical warning in Pinnacle21 2.2.0, leading to a lot of confusion as when leaving out "EPOCH" one also got a warning. So, whatever one did (adding EPOCH or leaving it out), one ALWAYS got a warning. This approach seems to have been given up in Pinnacle21 3.0.0, but it is still present (without a severity however) in MySEND.

Good is anyway, that for SV, both software packages do not generate a false positive error when "EPOCH" is added in the right place in SV. 

We must however emphasize that in modern IT, the order of the variables in such datasets is fully irrelevant. Essentially, SDTM is a "View" on a database (the original database being omitted in the submission). Also in modern databases, the order of the variables (columns) in such a "view" is completely irrelevant. The probable reason for this "order" requirement in SDTM is again the outdated XPT format, and the outdated tools reviewers are using at the FDA, such as the "SAS System Viewer", which is even not supported anymore by SAS itself. In our own open-source "Smart Submission Dataset Viewer", the order is not of importance, the columns can be moved from one place to another anyway (this is not possible with the SAS System Viewer).

CodeLists: TS

It is always very interesting to see how codelists are treated. Essentially, the define.xml is the "sponsor's truth", so ideally, a validator should check whether the codelist as given in the define.xml (which is very often a subset of the one published) matches the one from CDISC, taking "extensibility" into account, and taking "extended values" into account. When all that is OK, the validator should check the values in the datasets against those in the define.xml.

That this does not always work very well, is shown in a bug report for Pinnacle21 for the TS dataset, where it was reported that in some cases, value for TSPARMCD are not checked against the controlled terminology.
In the case of Pinnacle21, I kept keeping the message "TSPARMCD value not found in 'Trial Summary Parameter Test Code' extensible codelist" for the term "TEST", but now got a warning (in v.3.0) instead of an error (in v.2.2) when "TEST" was not in the codelist at all.
For me, this means that Pinnacle21 did not step down from its definition of a warning being as "something that may be unusual". In my opinion, when a term is correctly defined as an extended value in the define.xml, and it appears in the dataset, there should not be any report of it at all.