MIMIC-IV-Ext-22MCTS Dataset

Updated 17 February 2026

MIMIC-IV-Ext-22MCTS is a temporally annotated clinical time-series dataset containing over 22 million event–timestamp pairs from discharge summaries.
It employs a hybrid retrieval pipeline using BM25 and semantic search to efficiently extract short-span clinical events with explicit or inferred temporal cues.
The dataset boosts benchmarks in risk prediction, medical Q&A, clinical trial matching, and text generation, providing actionable insights for healthcare research.

The MIMIC-IV-Ext-22MCTS dataset is a large-scale, temporally annotated clinical time-series corpus constructed for risk prediction and related machine learning tasks in healthcare. Containing 22,588,586 event–timestamp pairs from 267,284 discharge summaries, it systematically addresses the limitations of unstructured clinical narrative data in MIMIC-IV-Note by extracting short-span clinical events with inferred or explicit relative temporal information. The dataset supports benchmarking for causal-correlation classification, medical question answering, clinical trial matching, and clinical text generation, and has been demonstrated to yield substantial improvements in standard model performance (Wang et al., 1 May 2025).

1. Data Origination and Preprocessing

MIMIC-IV-Ext-22MCTS is derived from the “discharge summary” section of MIMIC-IV-Note, comprising 331,794 free-text narratives (mean length 2,267 ± 914 tokens) documenting hospitalizations. The principal challenges in utilizing this corpus for machine learning are document length exceeding transformer context limitations (e.g., BERT's 512 tokens), and the sparse, implicit temporal cues for clinical events. To mitigate these issues, each summary is segmented into atomic “chunks”—blocks of at most five consecutive tokens plus five-token left/right context windows. For token sequence $T = [t_1, \dots, t_N]$ , chunk $j$ is defined as $C_j = [t_{p_j}, \dots, t_{p_j+4}]$ with $p_{j+1} = p_j + 5$ , $p_1=1$ . Context tokens $[t_{p_j-5}, \dots, t_{p_j-1}]$ and $[t_{p_j+5}, \dots, t_{p_j+9}]$ are included when available, yielding 100–400 chunks per summary.

2. Clinical Event Candidate Retrieval

Due to both computation and LLM hallucination risk, not all text chunks are passed to the LLM for annotation. Instead, a hybrid retrieval pipeline selects a high-recall candidate set through two complementary mechanisms:

Contextual BM25: The “Brief Hospital Course” section, denoted $q$ , serves as a query against all chunks $\{C_j\}$ . Chunks are scored by the standard BM25 ranking function:

$\mathrm{score}_{\mathrm{BM25}}(q, d) = \sum_{t\in q} \mathrm{IDF}(t)\,\frac{f(t,d)\,(k_1+1)}{f(t,d) + k_1 \left(1 - b + b\,\frac{|d|}{\bar L}\right)}$

with $k_1=1.2$ , $b=0.75$ , and top $K_{\rm bm25}=100$ chunks retained.

Contextual Semantic Search: Both query $q$ and every chunk $d$ are embedded into $\mathbb{R}^{1024}$ via “BAAI/bge-large-en”, and cosine similarity computed:

$\mathrm{sim}(q, d) = \frac{e_q \cdot e_d}{\|e_q\|\,\|e_d\|}$

Chunks with $\mathrm{sim}(q, d) \geq 0.75$ form a secondary candidate set.

The deduplicated union of both retrieval methods (typically 200–400 chunks per summary) is passed to the Llama-3.1-8B model for extraction and annotation.

3. Temporal Information Extraction and Event Annotation

Event–timestamp extraction relies on a chain-of-thought prompt given to Llama-3.1-8B. Event guidance instructs the model to identify all health-related actions, symptoms, or clinical states in the chunk, explicitly splitting conjunctive mentions (e.g., “fever and cough” → separate events). Timestamp guidance treats hospital admission (“Admission to hospital”) as time zero; events prior to admission receive negative hours, post-admission positive hours, using explicit or inferred cues. For event $E_i$ with inferred timestamp $t_i$ , the relative time is computed as:

$\Delta t_i = t_i - t_{\mathrm{admission}}$

with $t_{\mathrm{admission}} = 0$ by convention. When explicit temporal data are absent, model judgment translates natural language intervals (e.g., “a few weeks ago” to $-336$ hours).

4. Dataset Composition and Structure

The dataset comprises 267,284 summaries (conditions: $\geq1$ extracted event/time pair), 22,588,586 event–time pairs (mean 84 per summary, max 244), and an average event-span length of $3\pm2$ tokens (max 299). The relative time distribution is 36.99% pre-admission ( $\Delta t<0$ ), 51.19% during hospitalization ( $\Delta t\geq0$ ), and 11.80% after discharge ( $\Delta t>0$ beyond typical stay). The schema is as follows:

Column	Type	Description
hadm_id	INT	Unique hospital-admission identifier
event	TEXT	Untokenized free-text clinical span
time	INT	Relative timestamp (hours)
time_bin	INT	Discretized bin $\in\{0,\dots,8\}$

Time bins are defined by

$[-\infty,-60),\ [-60,-30),\ [-30,-15),\ [-15,0),\ [0,15),\ [15,30),\ [30,60),\ [60,120),\ [120,+\infty)$

5. Benchmarking and Performance

MIMIC-IV-Ext-22MCTS supports several benchmark predictive and generative tasks:

Causal-Correlation Classification: For pairs $(E_a, \Delta t_a)$ , $(E_b, \Delta t_b)$ , the model predicts whether $E_b$ is a consequence, possible cause, or uncorrelated. Temporal BERT architectures (BERT-base-uncased with added time-embedding) are used. Training employs an 80/20 train–val split, five epochs, Adam ( $lr=2\times10^{-5}$ ).
Medical Question Answering: Using PubMedQA, baseline BERT achieves $47.8\%\pm7.6\%$ accuracy, while Temporal BERT fine-tuned on MIMIC-IV-Ext-22MCTS achieves $54.1\%\pm16.7\%$ ( $\sim$ 10 percentage point absolute gain).
Clinical Trial Matching: On TREC 2021/2022 benchmarks, improvements in NDCG@10 observed (e.g., 33.28 → 36.53 for 2021; 29.43 → 29.94 for 2022), with similar gains in Precision@10 and Recall@100.
Text Generation: GPT-2 fine-tuned on the dataset, with additional [TIME] and [EVENT] tokens, produces more clinically coherent outputs (e.g., correct drug-dose recommendations) than baseline GPT-2.

6. Usage Considerations and Best Practices

For retrieval, the “Brief Hospital Course” summary should serve as the retrieval query. Sharding is to be performed with the provided code, using five-token chunks plus ±5 context. Both BM25 and embedding-based search are recommended to maximize retrieval recall. Following LLM annotation, minimal postprocessing is advised for duplicate and invalid timestamp removal. For supervised training, event–time sequences are concatenated as:

$[\texttt{[TIME]}~\Delta t_1;\ \texttt{[EVENT]}~E_1;\ \ldots;\ \texttt{[TIME]}~\Delta t_m;\ \texttt{[EVENT]}~E_m]$

Sample pseudo-API (abridged from original):

from mimic_ext_22mcts import load_events
df = load_events(split='train')
chunks = chunk_summary(my_summary, chunk_size=5, ctx=5)
bm25  = BM25(corpus=chunks)
hits  = bm25.query(bhc_summary, k=100)
embs_q = embedder.encode(bhc_summary)
embs_c = embedder.encode(chunks)
sem_hits = [c for c,s in zip(chunks,sim(embs_q,embs_c)) if s>=0.75]
candidates = set(hits) | set(sem_hits)
annotations = llama_annotate(candidates, prompt=standard_prompt)

7. Limitations and Prospective Extensions

Several limitations are inherent in the current release:

Domain Bias: The dataset is limited to adult ICU patients at a single U.S. center; generalizability to pediatric, outpatient, or non-U.S. settings remains untested.
Event Granularity: Extracted events are free text, lacking alignment to ontologies such as UMLS/SNOMED; structured coding is left to downstream mapping.
Timestamp Noise: Relative times rest on LLM inference; occasional clinically implausible or hallucinated values are possible.
Metadata: Demographics and structured EHR (labs, vitals) are absent.

Future work could integrate structured EHR (e.g., labs, vitals), align events post hoc to ontology codes, extend the framework to additional note types (progress, radiology), and incorporate uncertainty quantification on timestamps (e.g., confidence intervals). Model fine-tuning at larger scale (Llama 3 70B, GPT-4-class) is also a suggested avenue.

In summary, MIMIC-IV-Ext-22MCTS establishes a unique resource at scale for fine-grained, temporally explicit modeling of clinical risk, supporting benchmarking and development of advanced temporal analysis algorithms for healthcare research (Wang et al., 1 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MIMIC-\RNum{4}-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp for Risk Prediction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MIMIC-4-Ext-22MCTS.

MIMIC-IV-Ext-22MCTS Dataset

1. Data Origination and Preprocessing

2. Clinical Event Candidate Retrieval

3. Temporal Information Extraction and Event Annotation

4. Dataset Composition and Structure

5. Benchmarking and Performance

6. Usage Considerations and Best Practices

7. Limitations and Prospective Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MIMIC-IV-Ext-22MCTS Dataset

1. Data Origination and Preprocessing

2. Clinical Event Candidate Retrieval

3. Temporal Information Extraction and Event Annotation

4. Dataset Composition and Structure

5. Benchmarking and Performance

6. Usage Considerations and Best Practices

7. Limitations and Prospective Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research