Sunday, April 21, 2013

Alternatives for SAS Transport 5 - 20 years ago

I have been often criticized that I criticize SAS Transport 5 without proposing a better alternative.
This blog contribution is a response to that.

Many many years ago (I haven't been able to find out exactly how many years ago - If you know it, please let me know), the FDA decided for SAS Transport 5 (".xpt" format) for submission of datasets. It is not clear to me how and why they came to this decision. I discussed it with several people that were already around at that time, but their statements are not unambiguous.
It is currently however clear that this was not a good choice, but that is an easy statement many years later.

First of all, is the choice for transport format important?
Essentially, SDTM is a semantic standard: it is about content.
SAS XPT, XML, CSV and others are transport formats, you can transport any kind of data with them.
So, one could transport SDTM data using any transport format, as the two are essentially independant. Stating e.g. that HL7-v3 is a better format for transporting SDTM data as it allows integration with healthcare is nonsense! It is the same as stating that "using the same truck for transporting cows and for transporting oranges allows us to breed cows that produce orange juice".
However, we cannot expect from the FDA that it accepts SDTM (and SEND and ADaM) data in any transport format - a choice (only one, or maybe a few) must be selected.

So what could have been the alternatives 20 years ago?
First we must take into account that at that time we did not have define.xml for the metadata (XML was still in it's early stages and now well known): metadata was usually submitted as a text file or PDF.

Let us look at few alternatives ...

A. CSV
Comma-separated (or similar) format has the advantage that it is very compact. It is for example (using a vertical bar as field separator) used in HL7-v2 messages with a lot of success.
I recently generated some CSV files from SDTM XPT datasets and for example found that the LB dataset from the FDA Pilot "LZZT" study was reduced by a factor 6 in file size. "Great" you would say when you think about the "file size complaints" of the FDA. However CSV also has some major disadvantages. First of all: how to declare the variable names and labels? That can be solved by adding two extra lines at the top of the document containing them. But what about associated codelists for coded variables (not to speak about valuelists)? That is much more difficult. Of course one can add extra CSV datasets for the coded values (enumerations) too, but how can one make the link between both?
I think it can be done somewhat artificially, but it is not ideal.

B. SQL
SQL is a data definition and query language for relational databases. It is standardized as ANSI-SQL and already very long in place and very much used. In a previous blog I wrote that SDTM is not to be considered as a database, but as a "View" on a database. So instead of obtaining the database itself, the FDA has decided that we must submit "Views" on the database. This has many disadvantages, as in a "View" it is much more difficult to detect data inconsistencies, and it is difficult (if not impossible) to 100% reconstruct the original database. The choice for the "View" surely was surely related to the choice for SAS Transport 5.
So why not submit the database instead of the view on the database? It would have many advantages, as well for the submittors as for the FDA itself.
How can we exchange a database in a vendor-neutral way (one of the requirements)? The answer is "ANSI SQL". The design and metadata of the database could be exchanged using ANSI-SQL "Create Table" statements, with primary key and foreign keys and everything, so also defining the relations between the tables. The actual data could then be exchanged using "SQL Insert" statements.
What would be the advantages? First of all we see that SDTM is in each new version containing new and more derived variables. For example, RFXSTDTC (Date/Time of first Study Treatment) is also already present in the EX (Exposure) dataset. RFXSTDTC was only added to DM because the reviewers are unable to combine both datasets (they just use the viewer which is not "understanding" SDTM). Using ANSI-SQL we could (and we MUST - database "normal forms" require this) get rid of redundant information, which would mean that our data is much more consistent.
Once the FDA would have received the "SQL Create Table" and "SQL Insert" datasets, it could easily import them into a database (the first step in using a data warehouse by the way), with consistencies being checked over the foreign keys. They could then create their OWN views on the data (ANSI-SQL "Create View", such as views very similar to the current SDTM datasets.
This would mean that much less data would need to be submitted (I guess a 50% reduction or more) and that the FDA could easily create any possible views on the data, something they currently cannot do.

Remark that it is still possible to keep the "vertical" table structure as in the SDTM (in the database world, this is called the "entity-attribute-value" (EAV) model. It would also allow to keep the supplemental qualifiers within their parent dataset (a major problem in the current SDTM implementation).

C. Any other?
Would you have had another choice for a vendor-neutral transport format for dataset submission 20 years ago? Please let me know or write a comment.


My personal choice:
If I would have been the FDA twenty (or so) years ago, I would have chosen for ANSI-SQL as a vendor-neutral data submission format. But I am not the FDA, and I was not working in clinical research yet at that time. I do not know whether it has ever been considered (if you know, let me know).

So why not ANSI-SQL as the new transport format today?
Data exchange between computers and especially databases is usually not done using SQL nowadays. The reason is that we have a better alternative, namely XML. XML is not limited to representing flat, two-dimensional data, but has much more features. In a world "that is not flat", it can be used for representing many-dimensional data (though SDTM currently remains "flat" - will it after SHARE?) and it can also be used to e.g. visualize data (using stylesheets) something that cannot be done using the SQL format.

Your comments are welcome, as usual!


1 comment:

  1. I suspect these deciosns have their roots in the world 30 & 40 years ago not 20 years ago (ie the 90's). The alternatives at the time were binary and mostly not cross platform and ISAM files might have been attractive. For SAS at the time transport meant across platforms (VMS, P1me, DG Nova, MVS, VM/CMS, Unix, HP 3000) not to other products. I am not sure SQL was avaialable on all those platforms, neither did all DBs support importing tables very well.
    As I heard it the FDA forced SAS to publish the transport file spec so they could mandate it for all imports. It's limitations became more obvious as its replacement format got more capable and as SAS relaxed rules about variable length (in V8, around 20 years ago).

    ReplyDelete