Sunday, November 12, 2017

Excel and other flat file data to CDISC ODM

I have been so naive ...

After 15 years of CDISC ODM standard, I believed that its use is well spread and every EDC system supports in (at least have the capability to export as ODM). The enemy is another ...

I recently was in the clinical research center of a large hospital (no, I won't say where is was). On the wall there was a list with all the clinical studies that were running (over 100 studies). Some of them are run over the whole country. The table also showed a column with "EDC system". To my surprise, in about 60% of the studies, the EDC system listed was ... MS Excel. Even one of the (national) largest studies, with thousands of patients was using Excel as "the EDC system".

I recently had a potential customer for my SDTM-ETL mapping software. I gave two web conference demonstrations, as the customer has offices in several parts of the world. One of the questions I received was "can it read data from Excel worksheets?". I was astonished ...

The  SDTM-ETL mapping software uses CDISC ODM as input. Metadata is read from the "Study" element of an ODM file, allowing to use drag-and-drop for many of the SDTM variables. The mappings are stored in a define.xml file. Upon execution, the clinical data is read from an ODM file with clinical data.

When Excel is the data source however, what is the metadata? At the best, one has one or a few rows containing a label or an ID.

This all made me rethink ...

So I developed a new software to allow to transform exports from Excel as CSV files, but also other character delimited "flat" files, to CDISC ODM. I gave the software the name "ODM Generator".
With this software, clinical data centers that still use Excel or get data in any other flat file format (often the case for laboratory data) can ow easily transform these data into CDISC ODM and then combine these with metadata and clinical data from other sources.

Here is a snapshot of the main window:

In this case, the user already has selected a "flat" data file, and the system automatically recognizes that the vertical bar "|" was used as a delimiter. You can then set whether string are embedded in single or double quotes (usually the software will detect it itself - this is especially important when the delimiter is a comma). Also, if the first line of your file contains the headers of the columns, one should check the checkbox "First line contains column names". This will not only remove the first line from the loaded data, but later also propose the column names as the names for the ODM items.
In most cases, you will want to view the data as a table to see whether the import has been correctly. In order to do so, click the button "Show file as table". A new window than opens and shows all the data as a table:

Each column simply obtains a simple identifier: "F1", "F2", and so on. In case the first line contains the column headers, these are shown as a tooltip on the column:

 One can then start generating the ODM metadata by clicking the button "Start generating ODM metadata". The system will then analyze the data in the dataset and make a first proposal for the data types, the maximal length and other "ODM Item" metadata:

In case the first line contains the column names, the result is slightly different:

i.e. the column names will be proposed the become the "Item Name". One can now start completing the proposed metadata that will be used to generate an ODM.
Using the "validate" button (near the bottom) is always a good idea as some of the fields are mandatory or optional depending of the datatype:

A result of such a validation may for example be:

For fields 12, 13, and 18, a maximal length of 0 was set. In this case, this was due to the complete absence of data for this field in the source data.

In the next step, we want to assign the field containing the subject ID, and the visits, the forms, and the groups of items.
Based on the metadata, we see that the field "Subject ID or Identifier" (F12) is a very good candidate. We however also see (from the table with the actual data) that it is never populated. So we will use field 11 (Screen ID) as the subject identifier.
In order to do so, we use the dropdown for "Field for Subject ID", and select "F11":

The second level we need is the visit. Here field 20 "Visit ID or number" is a very good candidate:

and select "F20" from the dropdown for "Field for Visit (StudyEvent)":

We also need to define the ID of the form ("Field for form"). However, we don't find anything. Obviously, we might want that all these laboratory data go into a single form. So, we don't select anything for the dropdown "Field for Form". If no selection in one of the dropdowns on top of the window has been made, the system will always assume that there is a single one, and generate it itself. So the top of our window looks like:

In many cases, we will not want that all data goes into the ODM. For example, we could limit the data that go into the ODM to those that we later want to map to SDTM using the SDTM-ETL software. So we only select those fields that are of interest for further processing. For example:

We include fields 5 (Study ID), 8 (Site ID), 9 (Investigator ID), 10 (Investigator Name), 15 (Subject Sex) and a good number of other ones (not visible in the above image). Remark that in most cases, we will not select fields for which there is no data. In this case, we can easily detect them as the suggest value for "Length" is 0. This will not always be the case, as we might have a file where some data is missing, whereas we expect it in later files.
Also remark that field 11 has disappeared from the list, as we selected it for the "Field for Subject ID" and thus is included automatically.
When we now do a validation (using the "Validate" button) we see that only the selected fields are validated. If we still find validation issues, we might want to change our selection.

It is now time to save our mappings!
We do so using the button "Save mappings" which we find near the bottom of the window:

A file chooser will pop up, with a file chooser, allowing you to select a file and location. The generated file is a simple text file, with the content looking like:

When we later want to continue or work, or transform another file with similar data, we can simply load this mappings file from within the main window using the "Load Prior Mappings" button:

This mappings file can also later be used when doing batch processing of files with data (software currently in development).

In our "mappings" window, we can now start generating the ODM metadata and data using the button "Export as ODM":

A dialog is displayed:

You might want to either generate only the metadata (this will generate an ODM "Study" file), clinical data only (this will generate an ODM "ClinicalData" file), or a file with as well metadata (Study definition) as well as clinical data.
The second option will usually be used when you already generated the metadata before and you have loaded earlier developed mappings.

If we want to have our ODM study metadata and the clinical data in separate files, we first select "Metadata only", do an export, then select "Clinical data only" and then do a separate export for the clinical data only.
In our case, we want to have as well metadata as clinical data, so we select "Metadata + clinical data". After clicking the "OK" button, a new dialog is displayed asking us to provide a study identifier (in ODM: StudyOID). We use "MyStudy":


After "OK" again, a new dialog is displayed asking us for the study name, description and protocol title. These will go into the ODM elements "StudyName", "StudyDescription" and "ProtocolName": 

Again after "OK", a file selector is displayed allowing us to define where the file needs to be created and with which file name.
The ODM file that is then generated looks like:

with 3 "StudyEvents" generated (from the data itself), a single "default" form (as we selected nothing for the dropdown for "Form"), and a single "default" ItemGroup:

For each of the selected fields, we also see the metadata generated:

and with clinical data having:

When collapsing each of the "SubjectData", we see that clinical data for 3 subjects (taken from the source file) have been generated:

If we have saved the study metadata as a separate file, we could start refining the study metadata e.g. using the "ODM Study Designer" software, add the "question" for each data point definition (optional), adding codelists and their references (e.g. for "Subject Sex"), and even add SDTM mapping information using the "Alias" mechanism. We can validate our ODM metadata and data using the (for CDISC members free) "ODM Checker", and generate and execute mappings to SDTM using the SDTM-ETL software.

In near future, we will extend the features of the "ODM Generator". A few ideas we already have:
  • batch execution using saved mappings
  • different ItemGroups. For example for lab files, this can be based on the "test category"
  • further refinement of the automatically generated proposals in the metadata
With this new product, I hope that the 60% of the studies in the clinical research center of the large hospital I recently visited and that are using MS Excel as "the EDC system", will now be easily transformed to CDISC ODM, the worldwide standard for exchange of clinical research data and metadata.

No comments:

Post a Comment