MIMIC-Extract: ICU Data Pipeline
- MIMIC-Extract is an open-source pipeline that structures raw MIMIC-III data into standardized, analysis-ready formats for machine learning applications.
- It employs a modular architecture with distinct stages for cohort selection, time-varying feature extraction, and interventions mapping.
- The pipeline integrates seamlessly with Python libraries like Pandas, NumPy, PyTorch, and TensorFlow, enhancing reproducibility and extensibility in critical care research.
MIMIC-Extract is an open-source data extraction, preprocessing, and representation pipeline for the MIMIC-III electronic health record (EHR) database, specifically tailored for critical care machine learning applications. It systematically converts raw, complex MIMIC-III tables into standardized, analysis-ready dataframes and tensors compatible with mainstream machine learning frameworks, while preserving the temporal and semantic structure of clinical data (Wang et al., 2019).
1. Modular Pipeline Architecture
MIMIC-Extract is structured as a sequential pipeline composed of four primary modules:
- Cohort Selection: This module interfaces with the MIMIC-III relational database to identify the initial ICU stay for each adult subject (age ≥ 15 years) with a length of stay between 12 hours and 10 days. The output is a
patientstable, containing one row per ICU stay with associated static demographic and outcome variables. - Time-Varying Features Extraction: Raw measurements from
chartevents,labevents, and input/output sources (including fluids and procedures) are processed to:- Convert all units to standard targets (detailed in Section 2).
- Filter and clamp outliers to physiologic or extreme bounds.
- Map raw MIMIC ItemIDs to 104 clinical aggregate concepts via a supplied CSV.
- Discretize all time-stamped observations into hourly bins.
- Compute hourly per-concept statistics: mean, count, and standard deviation.
The outputs include the
vitals_labsandvitals_labs_meantables, each indexed by (subject_id, hadm_id, icustay_id, hours_in).
- Interventions Extraction: Using data from
procedureevents,inputevents_mv, and charted ventilator indicators, the pipeline generates binary, hourly time-series for interventions including mechanical ventilation, vasopressor use (by drug and overall), crystalloid and colloid boluses, as well as non-invasive ventilation (NIV) durations. The result is theinterventionstable, indexed congruently with the time-varying vitals/labs. - Persistence and Presentation: All resulting tables can be serialized to HDF5, CSV, or Numpy formats for immediate compatibility with pandas, NumPy, PyTorch, or TensorFlow without intermediate reformatting.
2. Standardized Data Processing Functions
MIMIC-Extract implements a set of rigorous preprocessing functions:
- Unit Conversions: Laboratory and vital sign data are normalized to standard SI units using explicit formulas (e.g., °F to °C, mg/dL to mmol/L for concentrations with molecular weight correction, mmHg to kPa, etc.). For example:
where is the analyte's molecular weight in g/mol.
- Outlier Detection and Clamping: Each concept is defined with four numerical thresholds in
variable_ranges.csv—two for extreme bounds (values outside are set to missing) and two for physiologic limits (values clamped when in between extreme and physiologic bounds). Optional population-based (z-score, IQR) filters are also available:
- Feature Aggregation: Because identical clinical concepts may be documented under different ItemIDs across MIMIC-III systems (CareVue, MetaVision), all ItemIDs are mapped to aggregate clinical concepts via a provided lookup table. Observations within each hourly bucket are then aggregated by mean, count, and standard deviation:
No weighting is applied to multiple values within each hour.
3. Representation of Clinical Time-Series Data
The pipeline produces four core Pandas DataFrames indexed by (subject_id, hadm_id, icustay_id, hours_in):
patients: static covariates, one row per ICU stay.vitals_labs: a 3D array of shape [N_icustays, T_max, 3×104] encompassing mean, count, and std for each clinical concept per hour.vitals_labs_mean: [N, T_max, 104], containing only the mean per concept per hour.interventions: [N, T_max, K], where binary hourly intervention indicators.
Sampling is at a fixed hourly resolution. For modeling, both fixed (e.g., first 24-hour window) and sliding window dynamic input paradigms are supported. Imputation of missing values is left to the downstream pipeline, with forward-filling and inclusion of mask/timestamp channels (e.g., as in GRU-D) cited as common strategies.
4. Integration and Workflow
End-users can load, manipulate, and convert the extracted data for use in machine learning experiments with minimal technical overhead:
- Python/Pandas/NumPy Integration: Data can be loaded from HDF5 or CSV using pandas, then converted to wide-format NumPy arrays for model input.
- PyTorch/TensorFlow Ready: The documentation provides code snippets for custom
Datasetobjects (PyTorch) and for direct feeding intotf.data.Dataset(TensorFlow). Examples cover reshaping, batching, and basic window extraction.
A high-level workflow, spanning SQL cohort extraction, iterative preprocessing and cleaning, concept mapping, binning, aggregation, and export, is provided as annotated pseudocode. Extension for downstream labeling (e.g., mortality) and model targets is immediate.
5. Extensibility and Customization
MIMIC-Extract is explicitly architected for extensibility with several mechanisms:
- Keyword Arguments: Parameters such as minimum age, duration, choices on grouping, and feature missingness thresholds can be specified without editing code.
- Custom Resource Files: Users may edit
itemid_to_variable_map.csv(for new mappings) andvariable_ranges.csv(for feature-specific thresholds) to incorporate additional concepts or redefine data cleaning bounds. - Embedded SQL: The distributed SQL scripts can be cloned and modified to include additional fields such as SOFA score, prescriptions, fluid volumes, and diagnosis-to-phenotype mappings.
- Custom DataFrame Hooks: Custom time-series features can be injected by extending the extractor with additional SQL queries or by merging on downstream tables, enabling integration of ICD diagnostic codes or multimodal data (e.g., free-text notes).
6. Benchmark Cohort and Reproducibility
Running the default pipeline reproduces a standardized 34,000-patient adult ICU cohort, with well-documented, fully auditable data cleaning and transformation steps. This enables consistent benchmark definition, rapid adaptation to new feature sets, and turnkey integration into modeling frameworks for tasks such as hospital mortality prediction, length of stay forecasting, or intervention prediction.
All source code, SQL, and resource files are available at https://github.com/MLforHealth/MIMIC_Extract, facilitating direct community adoption and reproducibility (Wang et al., 2019).