CRITICAL Dataset: Multi-Institutional EHR Benchmark

Updated 13 September 2025

CRITICAL dataset is a large-scale, multi-institutional EHR repository comprising 1.95B records with standardized clinical trajectories across pre-ICU, ICU, and post-ICU settings.
It addresses vocabulary heterogeneity by mapping diverse source codes to SNOMED-CT and standardizing units, thereby reducing data sparsity and enhancing interoperability.
The dataset supports reproducible machine learning benchmarks on clinical prediction tasks, providing robust performance metrics and comprehensive audit trails.

The CRITICAL dataset, as described in "The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data" (Luo et al., 10 Sep 2025), is a multi-institutional electronic health record (EHR) resource encompassing 1.95 billion records from 371,365 patients across four Clinical and Translational Science Award (CTSA) hubs. It provides comprehensive coverage of pre-ICU, ICU, and post-ICU encounters in both inpatient and outpatient settings, capturing extensive longitudinal and cross-institutional clinical trajectories. The CRISP pipeline, developed to process this dataset, implements systematic data quality management, cross-vocabulary mapping to unified SNOMED-CT standards, modular parallelized transformation, and establishes baseline machine learning benchmarks across multiple clinical prediction tasks.

1. Scale, Scope, and Data Characteristics

The CRITICAL dataset exhibits large scale and diversity, featuring 1.95 billion rows corresponding to 371,365 patients and 38 million visits, with data distributed across 17 OMOP CDM tables. The MEASUREMENT table alone contains over 1.4 billion rows. The dataset's observed time horizon spans a median duration of 3.11 years per patient, with some patient records exceeding 31.8 years longitudinal follow-up. Clinical coverage includes pre-ICU, ICU, and post-ICU events spanning both inpatient and outpatient domains, thus delivering end-to-end patient trajectories, a feature not present in legacy benchmarks such as MIMIC or eICU.

Table: Institutional Coverage and Record Volume

Institution Count	Patient Count	Total Records (Billion)
4 CTSA Sites	371,365	1.95

This longitudinal, multi-setting architecture enables a breadth of outcome studies, including those requiring pre-hospitalization baselines, long-term follow-up, and care transitions.

2. Vocabulary Heterogeneity and Data Harmonization

A central technical challenge addressed by CRISP is cross-vocabulary heterogeneity. The raw OMOP CDM extractions contain >150,000 unique source concepts from approximately 30 vocabularies (e.g., SNOMED, RxNorm, CPT4, ICD9CM, ICD10CM, ICD10PCS, HCPCS, LOINC). Within any given table, multiple coding systems may occur concurrently—for instance, ICD10PCS comprises nearly 57.8% of PROCEDURE_OCCURRENCE concepts, with SNOMED and CPT4 also present.

The consequence is a highly sparse and fragmented feature space: identical events may be encoded with different codes across institutions or time periods, hampering downstream ML utility.

To harmonize, CRISP performs crosswalk-driven many-to-one mapping of all source vocabularies to SNOMED-CT. Deduplication proceeds via composite keys:

$\text{Composite Key} = \text{person\_id} + \text{SNOMED\_id} + \text{datetime}$

This mapping ensures concept-level uniformity, de-fragmenting feature matrices and reducing sparsity. Units for quantitative measurements are standardized to UCUM conventions (e.g., temperature converted from Fahrenheit to Celsius), and fragmented episodes are merged within two-hour windows to reconstruct continuous care trajectories.

3. Modular Pipeline Architecture and Computational Optimization

CRISP features a five-stage modular pipeline architecture:

Exploratory Data Analysis: Automated reporting of missingness (>95% flagged), row counts, and patient-level summaries (e.g., ICU admission rates, demographics).
Cleaning and Preprocessing: Invalid values and duplicates are removed via composite primary keys; temporal consistency is enforced (start < end).
Cross-Vocabulary Mapping: Source codes (ICD9CM, ICD10CM, etc.) are mapped to SNOMED, with deduplication as described above.
Unit Standardization and Outlier Removal: Outliers are efficiently pruned using T-Digest calculations on multi-billion-row tables, preserving only the 1st–99th percentile range.
Extraction and Labeling: Data is reorganized from table-centric to patient-centric folders; encounter and label extraction (e.g., ICU status, event types) is performed for downstream ML tasks.

Parallelization and chunk-based loading yield a 4–6× speedup, with full-dataset processing completed in under 24 hours on standard hardware (12-core CPU, 64GB RAM). Adaptive patient-centric directory structures further optimize file I/O given variable patient record density.

4. Baseline Machine Learning Benchmarks

The harmonized dataset supports reproducible ML benchmarking on canonical clinical prediction tasks. Seven model types are established: Logistic Regression, Random Forest, Gradient Boosting, XGBoost, MLP, LSTM, and Temporal Convolutional Networks (TCN). Benchmarks are conducted using features extracted from five core OMOP tables over specified observation windows (e.g., 0–24 hours post-ICU split into 4-hour bins).

Prediction tasks and representative performance (AUROC):

Task	Models Evaluated	AUROC Range
7-/30-day mortality	LR, XGB, LSTM, TCN, MLP	>0.78
ICU length of stay (>3/7d)	LR, XGB, LSTM, TCN, MLP	0.619–0.755
Readmission risk	LR, XGB, LSTM, TCN, MLP	0.619–0.755
Sepsis onset (48h/7d)	LR, XGB, LSTM, TCN, MLP	See paper

Five-fold cross-validation is performed on patient-level splits, and each performance metric is reported as an empirical baseline for reference in subsequent clinical AI modeling efforts.

5. Data Quality Management and Audit Trails

CRISP implements column-wise and table-wise missingness tracking. Tables/columns exceeding 95% missingness are pruned. All preprocessing stages—including cleaning, mapping, deduplication, and unit conversions—are logged with comprehensive audit trails. Temporal inconsistency (e.g., negative time intervals), duplicate records (composite key collisions), and anomalous unit values (based on percentile calculations) are systematically removed.

6. Access, Usability, and Extensibility

Detailed processing documentation, source code, and baseline implementations are provided, enabling reproducibility and facilitating extension to custom prediction tasks, additional institutions, or updated OMOP versions. Modular architecture permits selective reprocessing (e.g., only mapping or only cleaning modules) and parameter tuning (e.g., number of workers for parallelization).

The pipeline design democratizes access for multi-institutional EHR researchers by removing barriers imposed by dataset heterogeneity and large-scale preprocessing complexity. By providing audit trails and patient-centric data reorganization, CRISP supports regulatory review, secondary analysis, and data sharing consistent with privacy standards.

7. Implications and Future Directions

The CRITICAL dataset, in conjunction with the CRISP pipeline, establishes a new paradigm for clinical AI benchmarking. The end-to-end harmonization and comprehensive baseline evaluations pave the way for generalizable, robust clinical prediction models and large-scale health equity research. A plausible implication is that further expansion—including ingestion of additional institutions, vocabularies, or advanced phenotyping algorithms—will progressively enhance the utility of the CRITICAL dataset as a foundation for method development, retrospective analysis, and population-scale modeling.

Importantly, systematic vocabulary mapping, deduplication, and auditability are now preconditions for trustworthy clinical modeling; heterogeneous, raw multi-institutional EHR datasets necessitate sophisticated pipelines to achieve reproducible and interpretable results. The open-source nature of CRISP further accelerates adoption and collaboration across the clinical informatics community.

In summary, the CRITICAL dataset and the CRISP pipeline together define current best practices for the preparation, harmonization, and benchmarking of large, multi-institutional EHR datasets for clinical prediction and health equity research (Luo et al., 10 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to CRITICAL Dataset.