MIMIC-IV: Open ICU EHR Corpus
- MIMIC-IV is a comprehensive open-source ICU/EHR database with detailed patient demographics, hospital events, and clinical time series from over 383,000 patients.
- Its modular architecture organizes data into core, hospital, and ICU modules, enabling reproducible cohort extraction and versatile clinical research.
- The database underpins diverse predictive modeling tasks, benchmark extraction pipelines, and extensions for NLP and multimodal analysis in translational medicine.
The Medical Information Mart for Intensive Care IV (MIMIC-IV) is the world's largest open-source, de-identified electronic health record (EHR) corpus focused on critical and emergency care, sourced from Beth Israel Deaconess Medical Center spanning 2008–2019. MIMIC-IV provides granular data on over 383,000 patients, comprising ICU encounters, emergency department visits, hospital stays, laboratory results, physiologic time series, intervention records, and clinical narratives. Its modular schema and robust linkage infrastructure underpin a broad spectrum of research in clinical informatics, machine learning, multimodal modeling, and translational medicine. Numerous benchmark datasets and extraction pipelines, including curated extensions (e.g., MIMIC-IV-ED, MIMIC-IV-Note, MIMIC-Sepsis, MIMIC-IV-Ext-22MCTS, MIMIC-IV-Ext-PE), leverage the foundational structure and scale of MIMIC-IV to enable reproducible task definitions, cohort construction, and federated analyses across diverse clinical domains.
1. Data Structure, Modules, and Coverage
MIMIC-IV employs a modular architecture, partitioned into three main modules: core, hosp, icu. The core module contains patient demographics (PATIENTS), admissions metadata (ADMISSIONS), and keys for cross-linkage. The hosp module aggregates all hospital-wide events (LABEVENTS, MICROBIOLOGYEVENTS, PROCEDURES_ICD, DIAGNOSES_ICD) and diagnostic codes (ICD-9/ICD-10). The icu module indexes ICU-specific time series via STAYS, CHARTEVENTS (hourly/categorical bedside data), INPUTEVENTS, OUTPUTEVENTS, and procedural interventions (Bui et al., 2024). Each hospital encounter is assigned a stay_id, enabling tracking across units and temporal windows.
Irregularity and sparsity are inherent; measurements can be missing for hours or days, with sampling intervals ranging from sub-minute to multi-day. Time-dependent features (vitals, labs, medications) are supplemented by static attributes (age, sex, comorbidities). Data tables contain NaNs for unrecorded times, with downstream pipelines constructing explicit missingness indicators (Liao et al., 2023). Event data is relayed at granularity ranging from discrete procedure flags to continuous-waveform samples (e.g., ECG at up to 500Hz (Lukyanenko et al., 2024)).
Recent releases (v2.2 and beyond) introduce schema upgrades and expanded stays—v3.1 for instance incorporates ~15% more ICU stays, updated code mappings, and refined linkage (Guo, 12 Jan 2026). Extensions like MIMIC-IV-Note store unstructured clinical narratives and enable new NLP benchmarks, while MIMIC-IV-ED provides structured emergency department episode data (Xie et al., 2021).
2. Benchmark Extraction Pipelines and Reproducible Cohort Construction
A core strength of MIMIC-IV is its support for extensible cohort extraction and reproducible preprocessing, as formalized by published pipelines such as METRE, Gupta et al.'s Data Pipeline, and others (Gupta et al., 2022, Liao et al., 2023). These frameworks modularize extraction into a sequential workflow:
- Cohort selection: Filter by age, ICU length of stay, disease condition (ICD grouping), data completeness, and time-to-event (e.g., CKD cohort in first 48h (Bui et al., 2024), sepsis-3 definitions (Huang et al., 28 Oct 2025)).
- Static table extraction: Demographics, comorbidities (Charlson, Elixhauser), insurance, admission/discharge metadata, and diagnostic codes.
- Time-dependent feature extraction: Hourly aggregations of vitals, labs, physiological outputs, procedural interventions (e.g., vasopressors, ventilation, fluid balances).
- Handling missingness: Forward-fill, linear interpolation, KNN-imputation, or explicit binary masking; variable-level strategies adapt to >80% missingness dropouts (Huang et al., 28 Oct 2025).
- Outlier removal: Clinical range thresholds per Harutyunyan et al.; winsorization and unit harmonization.
- Custom normalization and splitting: Min-max scaling of continuous features, one-hot/categorical encoding of diagnostics/interventions, standardized train/validation/test splits.
These pipelines output structured static and hourly-aligned data frames (typically in CSV, Parquet, or PostgreSQL table formats), accompanied by reproducibility logs detailing user choices (feature sets, time bin width, exclusion rules) (Gupta et al., 2022). Additional code packages support seamless extension to new phenotypes and clinical endpoints.
3. Task Definitions and Predictive Modeling Benchmarks
MIMIC-IV supports a wide array of clinically relevant supervised and unsupervised learning tasks. Benchmarks are systematically defined (with precise LaTeX formulas for labels and metrics):
- Mortality prediction: In-hospital, ICU, or short-term mortality (binary classification), with input windows of first 24–72h post-admission. AUROC, AUPRC, and group-stratified metrics are standard (Bui et al., 2024, Meng et al., 2021).
- Length of stay (LOS) prediction: Regression (remaining LOS at time t) or classification (LOS above a threshold), typically using time series up to t and static covariates. Tasks formalized as = future hours until ICU exit (Rocheteau et al., 2020).
- Readmission and phenotype prediction: Label defined by re-admission within τ days, or diagnosis of a target phenotype at next visit (Gupta et al., 2022).
- Disease-specific modeling: Sepsis (Sepsis-3 criteria, shock onset), delirium (unsupervised clustering, subgroup assignment), PE (binary/ordinal label based on radiology NLP), trajectory subtyping (e.g., delirium subphenotypes via clustering and semi-supervised expansion) (Huang et al., 28 Oct 2025, Zhao et al., 2021, Lam et al., 2024).
- Multimodal/fusion models: Survival modeling with image, text, and structured data (e.g., SAPS-II + CXR + report text in ICU mortality) (Lin et al., 2023).
Modeling frameworks include logistic regression, LSTM, TCN, Transformer encoders, gradient-boosted trees (XGBoost), pointwise convolutional networks, self-attentive tabular models (AutoInt), and modern state-space neural networks (Mamba) (Contreras et al., 2023, Guo, 12 Jan 2026).
Recent works stress the need for task-standardization: time-window alignments, variable selection, label definitions, controlled splits, and external cross-validation (e.g., model transfer from eICU to MIMIC-IV yields small AUC drops for mortality, but larger shifts for ventilation/ARF tasks) (Liao et al., 2023).
4. Handling Irregular Time Series, Missingness, and Feature Engineering
MIMIC-IV's event-driven irregularity and high missingness motivate sophisticated preprocessing and encoding strategies. Common approaches include:
- Aggregation: Resample events to hourly or multi-hour bins (e.g., 1h, 2h, 4h, 6h windows), track last/mean/min/max measurement per bin (Huang et al., 28 Oct 2025, Rocheteau et al., 2020).
- Missing data: Explicit binary mask channels (recorded/not), decay indicators (where is hours since last observation), or transformer-style [MASK] tokens for MLVM (Santos et al., 26 Feb 2025).
- Static vs dynamic fusion: Concatenate static features (demographics, comorbidities) with time-varying representations; embed categorical codes using BioBERT/BlueBERT embeddings for semantic richness (Santos et al., 26 Feb 2025, Lin et al., 2023).
- Forward-filling and imputation: Forward-fill within stays, global mean imputation for residuals, KNN-imputation for moderate missingness (Huang et al., 28 Oct 2025).
- Outlier removal and normalization: Thresholded clipping, unit harmonization, min-max or z-score scaling, winsorization at empirical percentiles.
- Temporal encoding: Position/time embeddings for transformers; summary statistics for tabular models; event triplets (time, code, value) for state-space models (Contreras et al., 2023, Guo, 12 Jan 2026).
Feature engineering pipelines track and audit all choices for reproducibility and support rapid prototyping of new extraction runs, cohort definitions, and modeling tasks (Gupta et al., 2022).
5. Model Evaluation, Interpretability, and Fairness Audits
Evaluation across tasks follows rigorous conventions. Split strategies (pre-registered, patient-level), cross-validation, and bootstrapping underpin robust metric reporting. Key metrics include:
- Discrimination: AUROC, AUPRC, C-index for survival analysis, mean absolute deviation/error for regression, Cohen's kappa for categorical prediction (Bui et al., 2024, Rocheteau et al., 2020, Lin et al., 2023).
- Calibration: Expected Calibration Error (ECE), Maximum Calibration Error (MCE) in evaluation scripts (Gupta et al., 2022).
- Fairness: Group-based AUROC (macro/min/minority), demographic parity, equalized odds/opportunity gaps. Fairness audits demonstrate variable treatment and prediction disparities (e.g., under-prescription of mechanical ventilation for Black patients), stratified model performance (IMV-LSTM yields smallest gap in AUC across groups), and recommendations for reweighting/debiasing protocols (Meng et al., 2021).
- Interpretability: Analysis of feature importance by perturbation (ArchDetect, SHAP), gradient-based methods (Integrated Gradients, DeepLift), and model-intrinsic (attention weights for IMV-LSTM, transformer layers). Cross-validation against known clinical factors (e.g., SAPS-II components, laboratory markers) is routine (Meng et al., 2021).
Downstream ablation studies and modality-specific analyses elucidate the contribution of interventions (medications, procedures) versus context (demographics, vitals), showing core signal arises from active treatments and physiological responses (Guo, 12 Jan 2026). Transparency in modeling—down to logging SQL definitions for queries or time-to-cohort generation—supports reproducibility and validation (Attrach et al., 27 Jun 2025).
6. Extensions, NLP Tasks, and Semantically-Enriched Datasets
MIMIC-IV underpins a rapidly growing set of publicly released extensions:
- MIMIC-IV-Note and MIMIC-IV-Ext-22MCTS: Clinical event extraction from discharge summaries using LLM-driven annotation, contextual chunking (BM25, BGE semantic search), and relative timestamp assignment via Llama-3.1-8B. Output: 22.6M event|time pairs across 267,284 notes; temporal bins for classification () (Wang et al., 1 May 2025).
- Temporal NLP benchmarks: Fine-tuning BERT/GPT-2 on temporally-annotated event sequences yields up to +10% accuracy in PubMedQA and +3% NDCG in TREC clinical trial matching, outperforming text-only baselines (Wang et al., 1 May 2025). Event correlation classification leverages timestamped pairs for consequence/antecedent inference.
- PE phenotype dataset (MIMIC-IV-Ext-PE): Radiology report NLP with VTE-BERT achieves 92% sensitivity, 88% PPV for acute PE detection in 19,942 CTPA studies, external validation by manual adjudication, and multi-source labeling integration (Lam et al., 2024).
- Conversational LLM-access (M3): Model-Context Protocol enables schema-inspected translation from natural language to SQL, returning precise cohort counts and structured results with a ∼93% reduction in analyst time and >90% accuracy (Attrach et al., 27 Jun 2025).
- Sepsis trajectories: MIMIC-Sepsis builds a cohort of 35,239 patients aligned by Sepsis-3, with timed interventions, and supports dynamic (shock, vasopressor, LOS) and static (mortality) benchmarking (Huang et al., 28 Oct 2025).
These extensions demonstrate how MIMIC-IV's structure supports semantic, temporal, and treatment-focused modeling, overcoming the limitations of pure tabular approaches.
7. Public Impact, Limitations, and Research Directions
MIMIC-IV is recognized as the canonical open-source ICU/EHR resource for model development, benchmarking, interpretability, fairness analysis, multimodal fusion, and NLP innovation. Publicly available extraction and evaluation codebases—METRE, Gupta pipeline, ICU-BERT, DT-ICU, APRICOT-M, Sepsis, Note/Ext-MCTS—advance the field's transparency and generalizability. Validation across external databases (eICU, UFH), cohort harmonization, and federated benchmarking ensure robust transferability and expose the impact of center-specific practice patterns (Liao et al., 2023, Contreras et al., 2023).
Challenges remain in variable mapping, irregularity handling, outlier documentation, and rare event calibration. Single-center bias (Beth Israel only) partially limits demographic and practice generalizability, though external validations demonstrate small AUC drops for core tasks and highlight areas where feature harmonization is critical (Contreras et al., 2023). Missingness thresholding and minimum event coverage filters are necessary to maintain high-quality model input (Guo, 12 Jan 2026).
Future directions include continuous updating of schema, deeper multimodal fusion (text–image–signal–tabular), expansion to federated learning, integration with OMOP for scalable variable mapping, and leaderboard-style benchmarking for progress tracking. Reproducibility, fairness, and open-data standards are consistently emphasized across published works. MIMIC-IV remains indispensable for method development in clinical AI, EHR processing, and large-scale biological signal modeling.