A Hundred Formats, No Translation

Clinical Trial Data Uses 100+ Incompatible Formats Across Registries, Sponsors, and Regulators

healthdigital

Problem Statement

Clinical trial data is generated across sponsors, contract research organizations (CROs), academic medical centers, and regulatory agencies using formats that cannot be combined without extensive manual harmonization. CDISC standards (SDTM, ADaM, CDASH) define common data models, but adoption varies: FDA requires CDISC for submissions but EMA, PMDA, and most national regulators accept proprietary formats. Within CDISC itself, implementation varies — a 2023 TransCelerate audit found that SDTM datasets from different sponsors used the same variable names for different data elements in 23% of cases. Electronic health record (EHR) data used for real-world evidence studies comes in FHIR, HL7 v2, CDA, or proprietary formats, none of which map cleanly to CDISC trial data models. The result is that combining data across trials for meta-analysis, safety signal detection, or regulatory review requires months of manual data harmonization per study.

Why This Matters

Drug development costs $1–2 billion per approved compound, and a substantial fraction of that cost is data management — reconciling formats, cleaning variables, and mapping terminologies across sites, sponsors, and regulators. The inability to aggregate clinical trial data efficiently delays safety signal detection: adverse events visible only by pooling data across sponsors may go undetected for years. The FDA's Real-World Evidence program depends on integrating trial data with EHR data — but the format gap between CDISC-structured trial data and HL7/FHIR-structured clinical data makes this integration a major bottleneck. Patients in rare diseases are particularly affected: with small trial populations, combining data across all available studies is essential for statistical power, but format incompatibility makes pooled analysis prohibitively expensive.

What’s Been Tried

CDISC standards have been in development since 1997 and adopted by FDA since 2004, but implementation inconsistency persists because the standards provide vocabulary without enforcing usage rules — sponsors interpret controlled terminology differently. The OMOP Common Data Model (Observational Health Data Sciences and Informatics) addresses EHR-to-research conversion but creates a parallel ecosystem that doesn't interoperate with CDISC. ClinicalTrials.gov collects trial metadata but not the underlying data. Attempts at universal patient identifiers (to link a patient's trial data with their EHR data) have been repeatedly blocked by privacy concerns and political opposition. The EU's European Health Data Space aims to enable cross-border clinical data exchange but relies on member states adopting compatible implementations — the same voluntary-adoption problem that CDISC faces.

What Would Unlock Progress

Automated semantic mapping tools that translate between CDISC, OMOP, and FHIR representations of the same clinical concepts — not requiring all parties to adopt a single standard but enabling translation at boundaries. Standardized variable-level metadata (including units, coding systems, and measurement protocols) embedded in data files rather than external documentation, so that format translation can be automated. Federated analysis platforms that query data in place without requiring centralized aggregation — each site maintains its own format, and the analysis query is translated at each site boundary.

Entry Points for Student Teams

A team could select a specific clinical domain (e.g., oncology or diabetes) and map how a single data element (e.g., tumor response, HbA1c measurement) is represented across CDISC SDTM, OMOP CDM, and FHIR, documenting where semantic differences prevent automated translation and proposing a mapping specification. A data engineering team could prototype an automated CDISC-to-FHIR translation layer for a subset of common data elements using existing mapping tables and test it against publicly available clinical trial datasets (from YODA Project or Vivli). Relevant disciplines: biomedical informatics, clinical research, data engineering, health policy.

Genome Tags

Constraint

dataregulatory

Domain

healthdigital

Scale

global

Failure

ignored-contextadoption-barrier

Breakthrough

data-integrationalgorithm

Stakeholders

multi-institution

Temporal

static

Tractability

proof-of-concept

Source Notes

Targets C7 (Data Interoperability). Matches C7's structural criterion: data exists in separate organizational systems (sponsors, CROs, regulators, hospitals), each format reflects operational needs, no single organization can mandate adoption, and the absence of interoperability prevents system-level capabilities (safety signal detection, real-world evidence, rare disease data pooling). The HL7 FHIR model — C7's primary transfer candidate — is directly relevant here as one of the competing formats. Distinct from `health-insulin-delivery-interoperability` (which is about medical device communication protocols, not clinical trial data formats) and `health-device-recall-udi-tracking` (which is about device tracking adoption, not data format interoperability).