MIMIC-III: ICU EHR Database

Updated 16 May 2026

MIMIC-III is a comprehensive, de-identified electronic health record database from ICU patients, providing detailed physiological measurements, lab results, and free-text clinical notes.
It supports multi-modal analytics by integrating structured, unstructured, and semi-structured data, enabling robust cohort definition and predictive modeling.
Benchmark tasks using MIMIC-III include mortality prediction, phenotyping, and ICD coding, which advance machine learning and clinical decision-making research.

The Medical Information Mart for Intensive Care III (MIMIC-III) is a large-scale, de-identified database comprising electronic health records (EHR) from intensive care unit (ICU) patients at Beth Israel Deaconess Medical Center between 2001 and 2012. Widely adopted in clinical informatics, machine learning, and natural language processing research, MIMIC-III provides detailed physiological measurements, laboratory results, interventions, diagnoses, free-text notes, and outcome data, supporting a broad range of tasks from mortality prediction to phenotyping and knowledge graph construction.

1. Structure, Content, and Cohort Definition

MIMIC-III v1.4 contains 46,520 unique patients, 58,976 hospital admissions, and 61,532 ICU stays spanning 2001–2012. The database incorporates both adult and pediatric ICU encounters, though most studies restrict analyses to adults (age ≥ 16). Key tables include ADMISSIONS, ICUSTAYS, CHARTEVENTS, LABEVENTS, OUTPUTEVENTS, SERVICES, DIAGNOSES_ICD, NOTEEVENTS (for free-text), PRESCRIPTIONS, among others (Wang et al., 2020).

A prototypical cohort definition excludes patients under 16, neonatal cases, and readmissions, and restricts to first ICU stays of sufficient duration (e.g., >24 or >48 h) to focus on granularity and avoid leakage (Wang et al., 2020, Nallabasannagari et al., 2020). Adult in-hospital mortality rates are approximately 11.5% and median age is 65.8 years.

Across studies, cohort sizes vary with inclusion criteria:

38,418 adult first long (≥48 h) ICU stays (Wang et al., 2020)
35,348 unique patients with single ICU admission ≥24 h (Nallabasannagari et al., 2020)
35627 adult first ICU admissions surviving ≥24 h (Purushotham et al., 2017)
Specific smaller phenotype cohorts (e.g., 3,301 septic patients for sepsis-AKI prediction (Roknaldin et al., 2024))

2. Data Modalities, Preprocessing, and Feature Engineering

MIMIC-III supports multi-modal analytics, encompassing:

Structured: Time-stamped vital signs, laboratory values, medication administrations, interventions
Unstructured: Clinical free-text notes (nursing/physician/respiratory/radiology/discharge summaries)
Semi-structured: Diagnosis/procedure code fields

Preprocessing establishes standardized, analyzable forms:

Temporal alignment: Time series resampled to fixed intervals (typically 1-h bins for 24–48 h) to address irregular sampling; forward- and backward-filling or population mean imputation for missing values (Wang et al., 2020, Purushotham et al., 2017).
Normalization: Physiological units are harmonized (e.g., all temperatures in Fahrenheit), and variables standardized for neural models (Wang et al., 2020, Purushotham et al., 2017).
Missing data: "NaN" tokens for categorical models, imputation for regression/classification (Nallabasannagari et al., 2020). In critical event sequences, no interpolation is sometimes used, with explicit missingness indicators (binary flags) instead (Kuo et al., 2021).
Text processing: Removal of identifiers, tokenization, stemming, stop-word elimination, bag-of-words/TF-IDF representation or embeddings for NLP (Li et al., 2018).
Feature sets: Studies use curated (e.g., SAPS-II 17 features), raw (e.g., 135 time-stamped variables), or task-optimized subsets (e.g., 23 correlating features for AKI prediction) (Wang et al., 2020, Purushotham et al., 2017, Roknaldin et al., 2024).

3. Benchmark Tasks and Machine Learning Methodologies

MIMIC-III is the canonical dataset for benchmarking clinical event prediction, phenotype classification, and medical code assignment:

Classification and Forecasting

Task	Metric(s)	Common Models
In-hospital mortality	AUROC, PR-AUC	LSTM/GRU RNNs, Logistic Regression, Deep Feed-forward Nets
Length-of-stay (LOS)	AUROC, AUPRC, MSE	Same as above
ICD code assignment	Micro/Macro-F₁	CNN, BERT, Attention, Multimodal Ensembles
Phenotyping	AUROC, Micro-F₁	Transformers, Semi-supervised Encoders
AKI prediction	AUROC, F₁	Logistic Regression, SVM, Random Forest, LightGBM
RL-based treatment	Accuracy, WIS	Batch-constrained Q-learning, CDEs

Deep Sequence Models: LSTM architectures (3-layers) on fixed-length binned time series enable modeling of temporal dependencies in event-driven variables (e.g., vital signs/labs per hour), yielding improved early warning for mortality and clinical deterioration (Wang et al., 2020). Studies consistently demonstrate that recurrent neural networks outperform logistic regression baselines by modest but reproducible margins (e.g., RNN-LSTM AUROC 0.600 vs. logistic regression AUROC 0.560 for ICU mortality (Wang et al., 2020)).

Multimodal and Multitask Learning: Joint embeddings of structured (time series) and unstructured (notes) data via dual Transformer encoders, contrastive pretraining, and masked token prediction yield superior performance, especially with few labels (King et al., 2023). Self-supervised and semi-supervised learning approaches demonstrate increases in AUC-ROC for mortality/phenotyping when ≤10% of outcome labels are available.

ICD Coding: Document-to-sequence BERT architectures with sequence-level attention, applied to discharge summaries (truncated to 2,500 tokens), advance the state-of-the-art in multi-label ICD-9 classification, surpassing strong CNN-attention baselines in macro-/micro-F₁ (Heo et al., 2021). Multimodal ensembling (unstructured, structured, and semi-structured models) further boosts ICD-10 coding accuracy (e.g., micro-F₁ 0.7633, micro-AUC 0.9541) (Xu et al., 2018).

4. Generation of Synthetic and Linked Data Resources

MIMIC-III has enabled the development of derivative resources:

Synthetic datasets: GAN-generated synthetic time series for acute hypotension and sepsis, constructed from MIMIC-III binned data, are published under privacy-preserving licenses. Stringent privacy audits (minimum Euclidean distance, record linkage risk ≤0.045%) ensure low re-identification risk. Released datasets facilitate offline reinforcement learning without legal/ethical constraints (Kuo et al., 2021).
Knowledge graphs: The PDD Graph is constructed by mining MIMIC-III for patients, prescriptions, and diagnoses, entity-linking these concepts to DrugBank and ICD-9 ontologies via statistical translation models and explicit constraint matching. This facilitates federated SPARQL queries, e.g., for treatment-outcome associations and adverse drug interactions (Wang et al., 2017).

5. Challenges in Coding, Annotation, and Validation

Systematic errors in manual ICD code annotation, notably under-coding of prevalent diagnoses and chronic conditions, challenge the integrity of downstream bench-marking and training:

For the top 10 most common codes, under-coding rates up to 35% are observed, with ~16% under-coding for “Unspecified essential hypertension” (ICD-9 401.9) (Searle et al., 2020).
A semi-supervised named-entity recognition and linking strategy combining MedCAT and manual annotation creates a high-precision "silver standard" for ICD codes. Manual review (κ = 0.85) reveals that longer and more complex admissions are systematically more heavily under-coded (Searle et al., 2020).
Uncritical use of MIMIC-III's original codes as gold standards is discouraged; proper validation and augmentation protocols are required to reduce label noise in clinical NLP research.

6. Practical Impact, Limitations, and Future Directions

MIMIC-III is established as the benchmark critical care database for academic research, with unique advantages stemming from its fine-grained, longitudinal, and de-identified nature. Studies repeatedly demonstrate the value of:

Multi-source modeling: Integrating all structured and unstructured EHR modalities yields significant AUC/PR-AUC improvements over single-source models for mortality, LOS, and critical event prediction (Nallabasannagari et al., 2020).
Temporal sequence modeling: LSTM/GRU-based architectures outperform static-feature models, especially for early detection of adverse outcomes (Wang et al., 2020, Purushotham et al., 2017).
Multimodality and semi-supervised methods: These approaches provide substantial gains under label scarcity and improve clinical interpretability (King et al., 2023, Xu et al., 2018).

However, limitations persist:

MIMIC-III is single-center (Beth Israel Deaconess), potentially limiting generalizability and introducing biases related to population composition and ICU practice patterns (Wang et al., 2020, Roknaldin et al., 2024).
Many studies exclude available high-frequency modalities (e.g., bedside waveforms) due to technical challenges.
Annotation errors, coding discrepancies, and missing data require robust pre-processing pipelines.
Most outcome models remain association-based or predictive, with causal discovery, transfer learning for individualized treatment, and prospective validation still emerging (Wang et al., 3 Jan 2025, Wang et al., 2022).

Future work will necessitate external validation on multi-institutional datasets, incorporation of additional temporal and clinical context (e.g., ventilator settings, continuous infusions), and greater model transparency for clinical integration. Synthetic datasets and linked knowledge graphs will further catalyze methodological development in reinforcement learning, ontology discovery, and large-scale biomedical informatics (Kuo et al., 2021, Wang et al., 2017).