Papers
Topics
Authors
Recent
Search
2000 character limit reached

CRISP: Integrated EHR Standardization Pipeline

Updated 14 January 2026
  • CRISP is an integrated framework that standardizes EHR data, delivering harmonized, ML-ready datasets via modular stages like EDA, cleaning, and vocabulary mapping.
  • It employs stringent data quality checks with detailed audit trails to ensure reproducibility and high-confidence model development.
  • The pipeline’s scalable, parallel processing architecture efficiently handles billions of records on commodity hardware, enabling rapid EHR harmonization.

The CRITICAL Records Integrated Standardization Pipeline (CRISP) is an end-to-end, modular data processing framework designed for large-scale, multi-institutional electronic health record (EHR) harmonization and ML-ready curation. Deployed on the CRITICAL dataset, which comprises 1.95 billion records from 371,365 patients across four Clinical and Translational Science Award (CTSA) institutions, CRISP integrates exploratory data analysis, stringent cleaning, cross-vocabulary mapping, deduplication, unit standardization, and rigorous audit trails. Its architectural innovations enable efficient processing of raw Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) datasets, yielding semantically harmonized, reproducibly benchmarked outputs to support clinical AI research and health equity investigations (Luo et al., 10 Sep 2025).

1. Modular Pipeline Architecture

CRISP is structured as a five-stage, fully modular pipeline, with an optional sixth stage dedicated to model benchmarking.

Stage Purpose Outputs/Audit Artifacts
Stage 1: @@@@4@@@@ Missingness, table/patient/date stats audit/eda/ (JSONs)
Stage 2: Cleaning Invalid, duplicate, temporal filtering audit/invalid/, /duplicates/, /temporal/
Stage 3: Vocabulary Crosswalk mapping to SNOMED-CT audit/unmapped/
Stage 4: Standardize Outlier removal, unit conversion audit/unitconv/, /outlier/
Stage 5: Patient-level Feature matrix construction patient-centric dataset
Benchmarking Module Model performance evaluation Baseline AUROC values

Each stage functions as an independent component accepting parameter files, versioning outputs, and serializing all transformations in detailed log directories. Researchers may execute, reconfigure, or skip stages without recompiling prior outputs, facilitating iterative experimentation and rapid prototyping.

2. Data Quality Management and Provenance

Component (a) encompasses Stages 1 and 2, focused on transparent quality management and comprehensive auditable provenance.

Stage 1: Exploratory Data Analysis (EDA)

  • For each OMOP table, column missingness is computed; columns are dropped if missingj/totalj>0.95{\mathrm{missing}_j/\mathrm{total}_j} > 0.95.
  • Table- and longitudinal-level statistics (row counts, patient counts, date ranges) are serialized as JSONs.

Stage 2: Data Cleaning Workflow

  • Invalid-concept filtering archives records with null or zero concept_id fields.
  • Duplicate elimination applies a composite-key criterion: records r2r_2 are dropped if r1\exists r_1 such that r1.person_id=r2.person_id,r1.concept_id=r2.concept_id,r1.datetime=r2.datetimer_1.\mathit{person\_id}=r_2.\mathit{person\_id}, r_1.\mathit{concept\_id}=r_2.\mathit{concept\_id}, r_1.\mathit{datetime}=r_2.\mathit{datetime}.
  • Temporal validation filters out records where start_time ≥ end_time or date > today.
  • Each cleaning step is logged and archived, enabling full reversibility or inspection of each transformation.

This comprehensive audit infrastructure ensures strict data provenance, a requirement for high-confidence model development and regulatory grade reproducibility.

3. Cross-Vocabulary Mapping, Deduplication, and Unit Standardization

Component (b) spans Stage 3 (cross-vocabulary mapping) and Stage 4 (unit standardization).

Cross-Vocabulary Mapping (Stage 3)

  • Heterogeneous coding patterns across MEASUREMENT, OBSERVATION, PROCEDURE_OCCURRENCE, and DEVICE_EXPOSURE tables are harmonized via a precomputed crosswalk: f(source_vocab,raw_code)SNOMED_CT_codef(\mathrm{source\_vocab}, \mathrm{raw\_code}) \longrightarrow \mathrm{SNOMED\_CT\_code}.
  • Unmappable records are archived for auditability.
  • Post-mapping deduplication targets ['person_id','standard_concept_id','datetime'].

Unit Standardization and Outlier Removal (Stage 4)

  • Numerical features are standardized to UCUM units. Conversion equations include:
    • Temperature: C=(F32)×59C = (F - 32) \times \frac{5}{9}
    • Weight: kg=lb×0.45359237\mathrm{kg} = \mathrm{lb} \times 0.45359237
    • Height: cm=in×2.54\mathrm{cm} = \mathrm{in} \times 2.54
  • Outlier removal employs t-digest algorithms for approximate percentile computation, removing values outside the 1st–99th percentiles.
  • Physiological range sanity checks are performed.
  • Visit records within two-hour windows are consolidated to reconstruct contiguous care episodes.

All transformation rules and audit artifacts are preserved, guaranteeing semantic consistency and numerical comparability across sites.

4. Parallel-Optimized Execution and Scalability

Component (c) operationalizes the modular pipeline with parallel, memory-efficient processing:

  • CRISP employs chunked I/O (1 million-row blocks), bounded-memory worker pools, and asynchronous execution (default pool size p=8p=8, configurable).
  • Pseudocode:
    1
    2
    3
    4
    
    pool = Pool(p)
    for chunk in read_in_chunks(table, size=chunk_size):
        pool.apply_async(process_stage, args=(chunk, params))
    pool.close(); pool.join()
  • Empirical benchmarks on standard hardware (12 cores, 64GB RAM) yield speedup=Tseq/Tpar4\mathrm{speedup} = {T_\mathrm{seq}/T_\mathrm{par}} \approx 46×6\times; 278.97GB comprising 1.95B records across 371,365 patients processes in under 24 hours.
  • This approach allows end-to-end harmonization on commodity resources, obviating the need for specialized clusters.

5. Baseline Predictive Modeling and Benchmarking

Component (d) is an optional module for establishing reproducible modeling performance standards:

  • Four ICU prediction tasks: mortality (7/30-day), length of stay (>3/>7 days), readmission (7/30/90 days), and sepsis onset (post-ICU, within 48h/7 days).
  • Feature matrix constructed from top 800 frequent concepts, discretized in 4-hour bins over a 24-hour observation window.
  • Models: Logistic Regression (LR), Random Forest (RF), Gradient Boosting (GB), XGBoost, Multi-Layer Perceptron (MLP), LSTM, Temporal Convolutional Network (TCN).
  • Five-fold cross validation with 80/20 split.
  • Reported AUROC values:
Task Best AUROC (Model)
Mortality 7d 0.814 (MLP)
Mortality 30d 0.835 (MLP)
LOS >3d 0.756 (MLP)
LOS >7d 0.767 (MLP)
Readmission 7d 0.755 (XGB)
Readmission 30d 0.743 (MLP)
Readmission 90d 0.746 (XGB)
Sepsis 48h 0.912 (MLP)

These baselines reveal both the value and remaining challenges of model and feature engineering in highly heterogeneous multi-institutional EHR contexts.

6. Enhancing Reproducibility and Community Access

CRISP advances reproducibility and democratizes large-scale EHR analytics:

  • Vocabulary Harmonization: CRISP maps 150,671 unique source concepts spanning 30 vocabularies into SNOMED CT and UCUM standards, densifying feature space and improving cross-site semantic integrity.
  • Configurable Modular Architecture: Independent, parameterized modules facilitate selective reprocessing and iterative design.
  • Auditability: Every data action and transformation produces human-readable logs and machine-readable artifacts; provenance is traceable at the granularity of individual records.
  • Open Source Availability: All code, comprehensive documentation, and processed outputs are made publicly available (https://github.com/AaronLuo00/CRISP-Pipeline), reducing time investment from months to days and shifting focus from preprocessing to modeling.
  • Technical Inclusion: By processing hundred-gigabyte, billion-row datasets on standard hardware in less than a day, CRISP lowers infrastructure barriers for research teams of varying technical capacities.

Collectively, these elements yield a transparent, high-throughput pipeline that transforms multi-site OMOP CDM data into harmonized, ML-ready artifacts, supporting cross-institutional model generalization, health equity analysis, and reproducible clinical AI research (Luo et al., 10 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CRITICAL Records Integrated Standardization Pipeline (CRISP).