Papers
Topics
Authors
Recent
2000 character limit reached

Common Data Model Overview

Updated 6 December 2025
  • Common Data Model (CDM) is a formal, shared schema that standardizes domain-specific entities, relationships, and process semantics for clear data integration.
  • CDMs define core entities, attributes, and mappings that enable efficient ETL workflows, semantic interoperability, and reproducible analytics in diverse fields.
  • By aligning heterogeneous data sources into canonical formats, CDMs ensure robust lineage tracking, improved query performance, and scalable multi-institutional workflows.

A Common Data Model (CDM) is a formal, shared schema for representing domain-specific entities, relationships, and attributes in a consistent, machine-readable fashion. CDMs serve as the backbone for semantic interoperability and data integration across heterogeneous sources, supporting standardized analytics, reproducible research, and scalable multi-institutional workflows. Prominent examples include the OMOP CDM in health informatics and the ISDA Common Domain Model in post-trade finance. These models encode not only data structures but, in advanced instances, the domain's process semantics, guaranteeing unambiguous data exchange, accurate lineage tracking, and consistent process automation.

1. Conceptual Foundations and Representational Scope

A CDM prescribes a canonical representation of real-world objects, relationships, and events within a domain, making explicit both data semantics and—where required—process logic.

  • In biomedical informatics, as instantiated in the OMOP CDM, schemas formalize persons, encounters, diagnoses, procedures, measurements, drug exposures, and unstructured observations, with each entity mapped to standardized terminologies (e.g., SNOMED CT, RxNorm, LOINC). Layered parent–child relationships mirror the hierarchical structure of clinical events (Person → Encounter → Document → Concept) (Liu et al., 2019).
  • In financial services, as in the ISDA CDM, structure and workflow are unified: every TradableProduct (e.g., InterestRateSwap) is recursively defined by its economic terms and legs; lifecycle events (Execution, Reset, Settlement) are encoded as composable primitives acting on explicit before–after TradeState pairs (Bakshi et al., 2021, Clack, 2017).

CDMs thus anchor both data and process standardization. They are typically technology-agnostic abstractions, designed to be mapped onto diverse physical storage (relational, document, event-sourced) and execution environments.

2. Schema Architecture and Entity Design

A CDM schema formalizes core entities, data types, allowed values, and relationships. The design prioritizes unambiguous mapping from heterogeneous source systems while preserving extensibility for domain evolution.

OMOP CDM Example (Desmond et al., 12 Nov 2025, Liu et al., 2019):

  • Primary Entity Tables (v5.0.2): Person, Visit Occurrence, Condition Occurrence, Procedure Occurrence, Drug Exposure, Measurement, Observation.
  • Unstructured Data: The Observation table encapsulates NLP-extracted concepts, contextual modifiers (negation, certainty), and raw text segment references.
  • Vocabulary Standardization: All clinical events reference canonical codes (SNOMED CT, RxNorm, LOINC) mapped via UMLS CUIs and ATHENA.

Custom Domain-Specific CDM Example (Wegner et al., 2021):

  • Attribute: Each CDM variable (e.g., AGE, SEX, RW_MMSTOT) is an explicit object, with fields for name, datatype, unit, permissible value range, and optional ontology URI.
  • Entity-Attribute-Value Model: DataPoint records (pid, attribute, value, timestamp) ensure extensibility for novel instruments and facilitate partial mapping from disparate cohorts.
  • Ontology Alignment: CDM Attributes can be linked to OMOP and FHIR definitions, supporting export or integration with global standards.

ISDA CDM Example (Clack, 2017, Bakshi et al., 2021, Nair et al., 2020):

  • Algebraic Data Types: Key types include Event, Operation (Before/After event list pairs), TradeState, Party, Amount, and EconomicsReference.
  • Operations and State Transitions: Each Operation transitions contract state according to a precise, audit-ready function, with lineage and timestamp reconstruction for every datum.
  • Process semantics are not adjunct—they are embedded at the model level, guaranteeing that every participant executes and records events identically.

3. Data Integration, ETL, and Semantic Mapping

Transformation of diverse, institution-specific data into a CDM involves ETL workflows with semantic mapping, unit normalization, and often value transformation.

Schema-Agnostic ETL Example (OMOP) (Desmond et al., 12 Nov 2025):

  • YAML-driven mappings: Each OMOP table is produced by config files specifying source keys, column logic (Column, LookupMapping, StaticMapping, Func), and—when needed—document flattening rules for semi-structured sources (MongoDB).
  • Provenance Tracking: A central mapping table records all source→CDM key relationships, timestamps, and processed status, supporting full traceability.
  • Incremental Logic: Instead of truncating, ETL computes Δt=St∖Mt−1\Delta_t = S_t \setminus M_{t-1} (new/changed source PKs since last successful run), ensuring efficiency and reproducibility at scale (case: 2.7M patients, 27M encounters, 54M measurements processed; 97% quality dashboard pass rate).

Semantic Mapping and Transformation Example (Wegner et al., 2021):

  • Each new source variable is matched (exact/fuzzy) to a CDM Attribute, documented via DataModelAttributeMapping (source→target, optional value transform Ï„x→α(x)\tau_{x\to\alpha(x)}).
  • Integration defined as f(p,x,v)=(p,α(x),Ï„x→α(x)(v))f(p, x, v) = (p, \alpha(x), \tau_{x\to\alpha(x)}(v)), with domain/range constraints enforced and unit normalization ensured where applicable.

4. Process Modeling and Event Semantics

Advanced CDMs further encode business process or clinical workflow semantics to eliminate ambiguity in system behavior, lifecycle management, and auditability.

ISDA CDM Process Semantics (Clack, 2017, Bakshi et al., 2021, Nair et al., 2020):

  • Events: Modeled with exacting state transitions—e.g., Reset(S,t)\mathit{Reset}(\mathcal{S}, t), Settle(S,t)\mathit{Settle}(\mathcal{S}, t)—embedded directly in the CDM definition.
  • Operation as Before/After Event Pair: Feeds the deterministic function statetransition:(Operationt,statet)→statet+1\text{statetransition}: (\text{Operation}_t, \text{state}_t) \to \text{state}_{t+1}.
  • Lineage and Versioning: All numeric values support provenance trees (Orig, Derived, timestamp), and state updates propagate their lineage, supporting full reconstructability and audit.
  • Consistency: While centralized implementations trivially ensure single-versioned truth, distributed deployments (e.g., DLT) must address CAP trade-offs, but the CDM itself exposes all semantics required for strong or eventual consistency protocols.

Biomedical Example (Cohort Retrieval) (Liu et al., 2019):

  • Clinical event history is modeled via parent–child associations (Person, Encounter, Document, ExtractedConcept), permitting both document-level and patient-level querying.
  • Unified index supports concept-based (CDM code) and full-text IR using BM25 and hierarchical document scoring, enabling direct evaluation of complex cohort definitions over integrated structured/unstructured data.

5. Interoperability, Extensibility, and Real-World Impact

The adoption of a CDM is primarily motivated by requirements for cross-system interoperability, data/statistical harmonization, and computational reproducibility. Evidence from multiple domains demonstrates substantial gains.

  • Biomedical Data Integration and Research Utility The OMOP CDM has become the de facto standard for multi-institutional observational health studies, supporting domain-spanning analytics, standardized cohort definition, and reproducible pipelines. The alignment of both structured EHR fact tables and unstructured NLP-extracted concepts onto a unified CDM hierarchy materially boosts performance—cohort retrieval P@5 improved from 0.54 (structured-only) and 0.74 (unstructured-only) to 0.90 for the integrated OMOP CDM approach in CREATE, with downstream portability across OMOP-compliant sites (Liu et al., 2019). The Data Steward Tool, leveraging a disease-specific CDM with 277 standardized variables, enables semantic integration across dementia research studies and immediate export to OMOP/FHIR for downstream analytics (Wegner et al., 2021).
  • Flowsheet Data Example Integration of highly granular flowsheet measurements into OMOP can follow a rapid JSON-dump approach (complete but unstandardized) or a labor-intensive semantic mapping (LOINC) workflow yielding maximal research utility. A plausible implication is that hybrid, staged ingestion—raw then iterative codification—best balances data completeness and standardization (Seto et al., 2021).
  • Financial Services Post-Trade Transformation The ISDA CDM, especially when embedded alongside an Authoritative Data Store (ADS), eliminates inconsistent processes and data duplication industry-wide. Adoption scenarios mapped via a four-step path (ETL overlay, local replica, hybrid, and fully native) estimate operational savings up to \$2.56B annually (≈80% of CDM-impacted dealer spend), cycle-time reductions of 50–70% in critical workflows, and a drop in matching errors from ~5% to <0.5% (Nair et al., 2020). Integration architectures range from classic centralized ADS operated by FMIs to DLT-facilitated, decentralized stores, with the CDM fully specifying event emission, trade state, and invariants (Bakshi et al., 2021, Clack, 2017).

6. Limitations, Open Challenges, and Future Directions

  • Semantic Coverage Trade-Offs Generic CDM schemas (e.g., OMOP) may not encompass domain-specific or rapidly-evolving constructs (e.g., neurological submeasures, FAQ forms in ataxia research). Domain-specific CDMs offer rapid extensibility but raise harmonization and cross-community reuse challenges (Wegner et al., 2021).
  • Automating Mappings Current pipelines for mapping source variables or free-text to CDM attributes are partly manual (edit distance, lookup, clinician review). Automating this process—using supervised ML, ontological reasoning, or collaborative artifact sharing—remains an open area (Seto et al., 2021).
  • Consistency and Versioning in Distributed Environments For process-integrated CDMs (e.g., ISDA CDM), strong consistency and auditability are native at model level, but guarantees in decentralized operational deployments depend on implementation-layer consensus or state-replication protocol choices (Clack, 2017, Bakshi et al., 2021).
  • Performance, Scalability, and Provenance Cost At scale (>50M measurement rows, >90M notes), even optimal ETL frameworks experience performance bottlenecks. Provenance-aware lineage tracking and incremental refresh strategies partially mitigate but do not eliminate computational cost (Desmond et al., 12 Nov 2025).
  • Inter-Model Operability While export and mapping to other standard formalisms (FHIR, i2b2, RDF/OWL) are feasible, full algorithmic integration requires agreement on semantics at the process as well as data level; partial homomorphism definitions are frequently invoked but rarely fully realized (Wegner et al., 2021, Desmond et al., 12 Nov 2025).

7. Summary Table: Representative CDMs and Use Cases

Domain CDM Name Key Application
Healthcare OMOP CDM EHR harmonization, cohort discovery, federated analytics
Biomed. Res. Dementia CDM Multi-study variable integration, semantic data sharing
Finance ISDA CDM Post-trade, lifecycle events, event-sourced settlement

CDMs are integral to the construction of interoperable, reproducible, and auditable distributed data ecosystems. Their rigorous specification of entities, events, and transformations forms the backbone for large-scale data integration, cohort discovery, and end-to-end automated workflows across medicine and finance (Desmond et al., 12 Nov 2025, Wegner et al., 2021, Clack, 2017, Bakshi et al., 2021, Nair et al., 2020, Liu et al., 2019, Seto et al., 2021).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Common Data Model (CDM).