MIMIC-IV-Note Clinical Notes Corpus

Updated 29 January 2026

MIMIC-IV-Note is a comprehensive corpus of de-identified clinical notes from ICU, ward, and clinic settings, widely used in clinical NLP research.
The dataset underpins methods for temporal event extraction and semantic analysis, facilitating robust risk prediction and automated text generation.
Integration of unstructured notes with structured EHR data has demonstrated significant improvements in prognostic modeling and clinical decision support.

MIMIC-IV-Note is the unstructured clinical free-text corpus released as part of the widely used MIMIC-IV electronic health record dataset. Assembled from real ICU, ward, and clinic settings (2008–2019), MIMIC-IV-Note provides the noteevents table containing ~400,000 de-identified notes, including discharge summaries, progress notes, radiology reports, and nursing entries, all linked to the corresponding structured data via patient and admission identifiers. It serves as a foundational resource for NLP, information extraction, risk prediction, cohort phenotyping, and clinical decision support tasks, and is central to recent advances in temporal time-series extraction, semantic understanding, and automated text generation in healthcare.

1. Structure and Content of MIMIC-IV-Note

MIMIC-IV-Note was designed to capture a broad spectrum of clinical documentation, providing timestamped, multi-type free-text notes mapped to patient admissions (subject_id, hadm_id, stay_id). Note types include discharge summaries (rich longitudinal episodes), progress notes (daily updates), radiology reports, and specialized entries from nursing or physician teams (Noori et al., 3 Sep 2025). The discharge summaries are notably lengthy, averaging >1,775 words per entry (Damm et al., 2024), and contain multi-section narratives covering admission events, clinical course, interventions, and discharge recommendations. Sections are not strictly templated, leading to significant heterogeneity in length, vocabulary, and contextual density. All notes are de-identified, preserving privacy via standardized PHI placeholders.

2. Temporal Clinical Event Extraction

The challenge of transforming MIMIC-IV-Note into usable temporal clinical time series data is addressed by the MIMIC-4-Ext-22MCTS pipeline (Wang et al., 1 May 2025). This process involves:

Chunking: Each discharge summary is tokenized, partitioned into short windows (≤5 tokens), with 10-token pre- and post-context, resulting in 200–400 contextualized chunks per summary.
BM25 and Semantic Retrieval: The “Brief Hospital Course” (BHC) is used as the query. Each chunk’s relevance is scored using BM25:

$\mathrm{score}_{BM25}(q,d) = \sum_{i=1}^{|q|} \mathrm{IDF}(q_i) \frac{f(q_i,d)(k_1+1)}{f(q_i,d) + k_1 (1-b + b|d|/\mathrm{avgdl})}$

as well as semantic retrieval using cosine similarity (embeddings with BAAI/bge-large-en), retaining chunks with $\mathrm{sim}(q,d)\ge0.75$ .

LLM Verification: Candidate chunks are processed via Llama-3.1-8B with a prompt guiding both event extraction and temporal inference, assigning timestamps as integer hours relative to admission ( $\Delta t = t_{\mathrm{event}} - t_{\mathrm{admission}}$ , 0 = admission).
The result is MIMIC-4-Ext-22MCTS: 22,588,586 event-timestamp pairs, each entry annotated with textual event, relative time, and discreet time bins.

This dataset explicitly overcomes the lack of timestamps and excessive length of MIMIC-IV-Note, enabling timeline-aware modeling for risk prediction and trajectory analysis (Wang et al., 1 May 2025).

3. Semantic Analysis and Clinical Concept Relationships

Noori et al. leverage MIMIC-IV-Note for SNOMED CT concept mapping, subsetting concepts identified by clinical NER (e.g., cTAKES/MedCAT), and exploring latent semantic and statistical relationships (Noori et al., 3 Sep 2025). Methods include:

Co-occurrence Statistics: Pointwise Mutual Information (PMI) and normalized PMI (NPMI):

$\mathrm{PMI}(c_i, c_j) = \log \frac{p(c_i, c_j)}{p(c_i) p(c_j)}$

$\mathrm{NPMI}(c_i, c_j) = \frac{\mathrm{PMI}(c_i, c_j)}{-\log p(c_i, c_j)}$

quantifying the association strength between concepts within notes.

Embedding-Based Similarity: Concepts are embedded with ClinicalBERT and BioBERT. Cosine similarity in embedding space is computed for all concepts.
Correlation Analysis: Weak but positive Spearman correlation ( $\rho\approx0.24$ early, $0.36$ late) between co-occurrence and semantic proximity, highlighting that embeddings uncover clinically meaningful but rarely co-occurring concept associations.
Clustering: K-means applied to embeddings yields interpretable groups (symptoms, labs, chronic conditions, cardiovascular disease), which stratify patient phenotypes, showing outcome and resource utilization differences.

These semantic analyses facilitate enhanced documentation completeness, concept suggestion, outcome modeling, and the definition of data-driven clinical phenotypes directly from MIMIC-IV-Note (Noori et al., 3 Sep 2025).

4. Automated Text Generation and Summarization

The WisPerMed framework exploits MIMIC-IV-Note for sequence-to-sequence modeling of discharge summary composition, particularly targeting the “Brief Hospital Course” and “Discharge Instructions” sections (Damm et al., 2024). Key elements include:

Section Extraction: Automated regex-based extraction of BHC and DI regions, with the rest of the summary serving as multi-modal context.
Model Architectures: Few-shot prompting (WizardLM-2-8x22B) and instruction tuning (Llama-3-8B, Mistral-7B, OpenBioLLM-70B) with LoRA adapters.
Priming: Prior tuning on Asclepius (158,000 synthetic Q&A notes) enhances clinical style adherence.
Dynamic Expert Selection (DES): Output selection across models using a normalized, weighted score sum over factuality (MEDCON, METEOR), readability (FKGL, DCRS, CLI), and length, with DES5 (length-based heuristic) achieving a top overall score of 0.332 in the BioNLP ACL “Discharge Me!” challenge.

This demonstrates that MIMIC-IV-Note enables competitive automated medical documentation, freeing clinical time and supporting operational scalability (Damm et al., 2024).

5. Note Segmentation and Structuring for Downstream Applications

Reliable boundary detection within MIMIC-IV-Note is critical for downstream NLP and information extraction (Surana et al., 28 Dec 2025). CNSight benchmarks methods for segmenting 1,000 discharge summaries:

Model	Sentence F1	Freetext F1	Avg F1
MedSpaCy	78.0	88.3	83.2
GPT-5-mini	80.8	63.9	72.4
Gemini 2.5 Flash	78.5	60.8	69.7
Claude 4.5 Haiku	76.5	47.8	62.2
LLaMA-2-7B	8.9	52.4	30.7

Rule-Based Baselines (MedSpaCy, Regex): Competitive for structured headers (F1 up to 88.3).
Transformer Models: Small, open-source LLMs are substantially weaker on segmentation (<10 F1), while API-hosted large models (GPT-5-mini) balance precision/recall at the sentence level (F1 80.8).
Integration Guidance: Rule-based approaches preferable for boundary-critical extraction. Large LLMs excel in robust segment classification at modest API cost.

High-fidelity segmentation of MIMIC-IV-Note is essential for NER, cohort ID, summarization, and structuring tasks (Surana et al., 28 Dec 2025).

6. Prognostic Modeling and Integration with Structured Data

Predictive modeling on ICU cohorts using MIMIC-IV-Note has demonstrated substantial improvements in risk stratification when unstructured notes are combined with structured data (Mamatov et al., 20 Oct 2025). The workflow includes:

Extraction of discharge summaries and radiology reports (noteevents) linked to index ICU stays.
Preprocessing (lowercasing, tokenization, stopword removal), parallel TF-IDF (vectorized, SVD-reduced) and BioBERT (mean-pooled, PCA-reduced) pipelines; textual features concatenated with structured clinical variables.
Feature selection via LASSO and XGBoost, with intersection retained for final model, collinearity filtered (VIF < 5).
Multivariate logistic regression:

$\logit(p) = \beta_0 + \sum_j \beta_j x_j^{(\text{struct})} + \sum_k \gamma_k x_k^{(\text{text})}$

Resulting model achieves AUC=0.918 (vs. 0.753 structured-only), with recall improving from 0.71 to 0.88.

Effective integration of MIMIC-IV-Note thus sharply increases discriminatory and net clinical benefit, supporting personalized ICU interventions (Mamatov et al., 20 Oct 2025).

7. Significance and Research Trajectory

MIMIC-IV-Note is a linchpin for modern clinical NLP and machine learning research. Its unstructured documentation underlies advances in time-series event extraction (Wang et al., 1 May 2025), semantic relationship mining (Noori et al., 3 Sep 2025), robust note segmentation (Surana et al., 28 Dec 2025), automated generation (Damm et al., 2024), and multimodal prognostic modeling (Mamatov et al., 20 Oct 2025). While challenges remain—such as lack of strict templating, annotation noise, and generalizability to other settings—MIMIC-IV-Note’s heterogeneity and scale foster continual methodological innovation. Its role is expanding, with a plausible implication that temporal and semantic structuring of clinical text will become mandatory in real-time decision support, outcome prediction, and advanced phenotyping pipelines across global healthcare environments.