CRISP: Integrated EHR Standardization Pipeline
- CRISP is an integrated framework that standardizes EHR data, delivering harmonized, ML-ready datasets via modular stages like EDA, cleaning, and vocabulary mapping.
- It employs stringent data quality checks with detailed audit trails to ensure reproducibility and high-confidence model development.
- The pipeline’s scalable, parallel processing architecture efficiently handles billions of records on commodity hardware, enabling rapid EHR harmonization.
The CRITICAL Records Integrated Standardization Pipeline (CRISP) is an end-to-end, modular data processing framework designed for large-scale, multi-institutional electronic health record (EHR) harmonization and ML-ready curation. Deployed on the CRITICAL dataset, which comprises 1.95 billion records from 371,365 patients across four Clinical and Translational Science Award (CTSA) institutions, CRISP integrates exploratory data analysis, stringent cleaning, cross-vocabulary mapping, deduplication, unit standardization, and rigorous audit trails. Its architectural innovations enable efficient processing of raw Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) datasets, yielding semantically harmonized, reproducibly benchmarked outputs to support clinical AI research and health equity investigations (Luo et al., 10 Sep 2025).
1. Modular Pipeline Architecture
CRISP is structured as a five-stage, fully modular pipeline, with an optional sixth stage dedicated to model benchmarking.
| Stage | Purpose | Outputs/Audit Artifacts |
|---|---|---|
| Stage 1: @@@@4@@@@ | Missingness, table/patient/date stats | audit/eda/ (JSONs) |
| Stage 2: Cleaning | Invalid, duplicate, temporal filtering | audit/invalid/, /duplicates/, /temporal/ |
| Stage 3: Vocabulary | Crosswalk mapping to SNOMED-CT | audit/unmapped/ |
| Stage 4: Standardize | Outlier removal, unit conversion | audit/unitconv/, /outlier/ |
| Stage 5: Patient-level | Feature matrix construction | patient-centric dataset |
| Benchmarking Module | Model performance evaluation | Baseline AUROC values |
Each stage functions as an independent component accepting parameter files, versioning outputs, and serializing all transformations in detailed log directories. Researchers may execute, reconfigure, or skip stages without recompiling prior outputs, facilitating iterative experimentation and rapid prototyping.
2. Data Quality Management and Provenance
Component (a) encompasses Stages 1 and 2, focused on transparent quality management and comprehensive auditable provenance.
Stage 1: Exploratory Data Analysis (EDA)
- For each OMOP table, column missingness is computed; columns are dropped if .
- Table- and longitudinal-level statistics (row counts, patient counts, date ranges) are serialized as JSONs.
Stage 2: Data Cleaning Workflow
- Invalid-concept filtering archives records with null or zero concept_id fields.
- Duplicate elimination applies a composite-key criterion: records are dropped if such that .
- Temporal validation filters out records where start_time ≥ end_time or date > today.
- Each cleaning step is logged and archived, enabling full reversibility or inspection of each transformation.
This comprehensive audit infrastructure ensures strict data provenance, a requirement for high-confidence model development and regulatory grade reproducibility.
3. Cross-Vocabulary Mapping, Deduplication, and Unit Standardization
Component (b) spans Stage 3 (cross-vocabulary mapping) and Stage 4 (unit standardization).
Cross-Vocabulary Mapping (Stage 3)
- Heterogeneous coding patterns across MEASUREMENT, OBSERVATION, PROCEDURE_OCCURRENCE, and DEVICE_EXPOSURE tables are harmonized via a precomputed crosswalk: .
- Unmappable records are archived for auditability.
- Post-mapping deduplication targets
['person_id','standard_concept_id','datetime'].
Unit Standardization and Outlier Removal (Stage 4)
- Numerical features are standardized to UCUM units. Conversion equations include:
- Temperature:
- Weight:
- Height:
- Outlier removal employs t-digest algorithms for approximate percentile computation, removing values outside the 1st–99th percentiles.
- Physiological range sanity checks are performed.
- Visit records within two-hour windows are consolidated to reconstruct contiguous care episodes.
All transformation rules and audit artifacts are preserved, guaranteeing semantic consistency and numerical comparability across sites.
4. Parallel-Optimized Execution and Scalability
Component (c) operationalizes the modular pipeline with parallel, memory-efficient processing:
- CRISP employs chunked I/O (1 million-row blocks), bounded-memory worker pools, and asynchronous execution (default pool size , configurable).
- Pseudocode:
1 2 3 4
pool = Pool(p) for chunk in read_in_chunks(table, size=chunk_size): pool.apply_async(process_stage, args=(chunk, params)) pool.close(); pool.join()
- Empirical benchmarks on standard hardware (12 cores, 64GB RAM) yield –; 278.97GB comprising 1.95B records across 371,365 patients processes in under 24 hours.
- This approach allows end-to-end harmonization on commodity resources, obviating the need for specialized clusters.
5. Baseline Predictive Modeling and Benchmarking
Component (d) is an optional module for establishing reproducible modeling performance standards:
- Four ICU prediction tasks: mortality (7/30-day), length of stay (>3/>7 days), readmission (7/30/90 days), and sepsis onset (post-ICU, within 48h/7 days).
- Feature matrix constructed from top 800 frequent concepts, discretized in 4-hour bins over a 24-hour observation window.
- Models: Logistic Regression (LR), Random Forest (RF), Gradient Boosting (GB), XGBoost, Multi-Layer Perceptron (MLP), LSTM, Temporal Convolutional Network (TCN).
- Five-fold cross validation with 80/20 split.
- Reported AUROC values:
| Task | Best AUROC (Model) |
|---|---|
| Mortality 7d | 0.814 (MLP) |
| Mortality 30d | 0.835 (MLP) |
| LOS >3d | 0.756 (MLP) |
| LOS >7d | 0.767 (MLP) |
| Readmission 7d | 0.755 (XGB) |
| Readmission 30d | 0.743 (MLP) |
| Readmission 90d | 0.746 (XGB) |
| Sepsis 48h | 0.912 (MLP) |
These baselines reveal both the value and remaining challenges of model and feature engineering in highly heterogeneous multi-institutional EHR contexts.
6. Enhancing Reproducibility and Community Access
CRISP advances reproducibility and democratizes large-scale EHR analytics:
- Vocabulary Harmonization: CRISP maps 150,671 unique source concepts spanning 30 vocabularies into SNOMED CT and UCUM standards, densifying feature space and improving cross-site semantic integrity.
- Configurable Modular Architecture: Independent, parameterized modules facilitate selective reprocessing and iterative design.
- Auditability: Every data action and transformation produces human-readable logs and machine-readable artifacts; provenance is traceable at the granularity of individual records.
- Open Source Availability: All code, comprehensive documentation, and processed outputs are made publicly available (https://github.com/AaronLuo00/CRISP-Pipeline), reducing time investment from months to days and shifting focus from preprocessing to modeling.
- Technical Inclusion: By processing hundred-gigabyte, billion-row datasets on standard hardware in less than a day, CRISP lowers infrastructure barriers for research teams of varying technical capacities.
Collectively, these elements yield a transparent, high-throughput pipeline that transforms multi-site OMOP CDM data into harmonized, ML-ready artifacts, supporting cross-institutional model generalization, health equity analysis, and reproducible clinical AI research (Luo et al., 10 Sep 2025).