Papers
Topics
Authors
Recent
Search
2000 character limit reached

MIMIC-IV-Ext-22MCTS Dataset

Updated 17 February 2026
  • MIMIC-IV-Ext-22MCTS is a temporally annotated clinical time-series dataset containing over 22 million event–timestamp pairs from discharge summaries.
  • It employs a hybrid retrieval pipeline using BM25 and semantic search to efficiently extract short-span clinical events with explicit or inferred temporal cues.
  • The dataset boosts benchmarks in risk prediction, medical Q&A, clinical trial matching, and text generation, providing actionable insights for healthcare research.

The MIMIC-IV-Ext-22MCTS dataset is a large-scale, temporally annotated clinical time-series corpus constructed for risk prediction and related machine learning tasks in healthcare. Containing 22,588,586 event–timestamp pairs from 267,284 discharge summaries, it systematically addresses the limitations of unstructured clinical narrative data in MIMIC-IV-Note by extracting short-span clinical events with inferred or explicit relative temporal information. The dataset supports benchmarking for causal-correlation classification, medical question answering, clinical trial matching, and clinical text generation, and has been demonstrated to yield substantial improvements in standard model performance (Wang et al., 1 May 2025).

1. Data Origination and Preprocessing

MIMIC-IV-Ext-22MCTS is derived from the “discharge summary” section of MIMIC-IV-Note, comprising 331,794 free-text narratives (mean length 2,267 ± 914 tokens) documenting hospitalizations. The principal challenges in utilizing this corpus for machine learning are document length exceeding transformer context limitations (e.g., BERT's 512 tokens), and the sparse, implicit temporal cues for clinical events. To mitigate these issues, each summary is segmented into atomic “chunks”—blocks of at most five consecutive tokens plus five-token left/right context windows. For token sequence T=[t1,,tN]T = [t_1, \dots, t_N], chunk jj is defined as Cj=[tpj,,tpj+4]C_j = [t_{p_j}, \dots, t_{p_j+4}] with pj+1=pj+5p_{j+1} = p_j + 5, p1=1p_1=1. Context tokens [tpj5,,tpj1][t_{p_j-5}, \dots, t_{p_j-1}] and [tpj+5,,tpj+9][t_{p_j+5}, \dots, t_{p_j+9}] are included when available, yielding 100–400 chunks per summary.

2. Clinical Event Candidate Retrieval

Due to both computation and LLM hallucination risk, not all text chunks are passed to the LLM for annotation. Instead, a hybrid retrieval pipeline selects a high-recall candidate set through two complementary mechanisms:

  • Contextual BM25: The “Brief Hospital Course” section, denoted qq, serves as a query against all chunks {Cj}\{C_j\}. Chunks are scored by the standard BM25 ranking function:

scoreBM25(q,d)=tqIDF(t)f(t,d)(k1+1)f(t,d)+k1(1b+bdLˉ)\mathrm{score}_{\mathrm{BM25}}(q, d) = \sum_{t\in q} \mathrm{IDF}(t)\,\frac{f(t,d)\,(k_1+1)}{f(t,d) + k_1 \left(1 - b + b\,\frac{|d|}{\bar L}\right)}

with k1=1.2k_1=1.2, b=0.75b=0.75, and top Kbm25=100K_{\rm bm25}=100 chunks retained.

  • Contextual Semantic Search: Both query qq and every chunk dd are embedded into R1024\mathbb{R}^{1024} via “BAAI/bge-large-en”, and cosine similarity computed:

sim(q,d)=eqedeqed\mathrm{sim}(q, d) = \frac{e_q \cdot e_d}{\|e_q\|\,\|e_d\|}

Chunks with sim(q,d)0.75\mathrm{sim}(q, d) \geq 0.75 form a secondary candidate set.

The deduplicated union of both retrieval methods (typically 200–400 chunks per summary) is passed to the Llama-3.1-8B model for extraction and annotation.

3. Temporal Information Extraction and Event Annotation

Event–timestamp extraction relies on a chain-of-thought prompt given to Llama-3.1-8B. Event guidance instructs the model to identify all health-related actions, symptoms, or clinical states in the chunk, explicitly splitting conjunctive mentions (e.g., “fever and cough” → separate events). Timestamp guidance treats hospital admission (“Admission to hospital”) as time zero; events prior to admission receive negative hours, post-admission positive hours, using explicit or inferred cues. For event EiE_i with inferred timestamp tit_i, the relative time is computed as:

Δti=titadmission\Delta t_i = t_i - t_{\mathrm{admission}}

with tadmission=0t_{\mathrm{admission}} = 0 by convention. When explicit temporal data are absent, model judgment translates natural language intervals (e.g., “a few weeks ago” to 336-336 hours).

4. Dataset Composition and Structure

The dataset comprises 267,284 summaries (conditions: 1\geq1 extracted event/time pair), 22,588,586 event–time pairs (mean 84 per summary, max 244), and an average event-span length of 3±23\pm2 tokens (max 299). The relative time distribution is 36.99% pre-admission (Δt<0\Delta t<0), 51.19% during hospitalization (Δt0\Delta t\geq0), and 11.80% after discharge (Δt>0\Delta t>0 beyond typical stay). The schema is as follows:

Column Type Description
hadm_id INT Unique hospital-admission identifier
event TEXT Untokenized free-text clinical span
time INT Relative timestamp (hours)
time_bin INT Discretized bin {0,,8}\in\{0,\dots,8\}

Time bins are defined by

[,60), [60,30), [30,15), [15,0), [0,15), [15,30), [30,60), [60,120), [120,+)[-\infty,-60),\ [-60,-30),\ [-30,-15),\ [-15,0),\ [0,15),\ [15,30),\ [30,60),\ [60,120),\ [120,+\infty)

5. Benchmarking and Performance

MIMIC-IV-Ext-22MCTS supports several benchmark predictive and generative tasks:

  • Causal-Correlation Classification: For pairs (Ea,Δta)(E_a, \Delta t_a), (Eb,Δtb)(E_b, \Delta t_b), the model predicts whether EbE_b is a consequence, possible cause, or uncorrelated. Temporal BERT architectures (BERT-base-uncased with added time-embedding) are used. Training employs an 80/20 train–val split, five epochs, Adam (lr=2×105lr=2\times10^{-5}).
  • Medical Question Answering: Using PubMedQA, baseline BERT achieves 47.8%±7.6%47.8\%\pm7.6\% accuracy, while Temporal BERT fine-tuned on MIMIC-IV-Ext-22MCTS achieves 54.1%±16.7%54.1\%\pm16.7\% (\sim10 percentage point absolute gain).
  • Clinical Trial Matching: On TREC 2021/2022 benchmarks, improvements in NDCG@10 observed (e.g., 33.28 → 36.53 for 2021; 29.43 → 29.94 for 2022), with similar gains in Precision@10 and Recall@100.
  • Text Generation: GPT-2 fine-tuned on the dataset, with additional [TIME] and [EVENT] tokens, produces more clinically coherent outputs (e.g., correct drug-dose recommendations) than baseline GPT-2.

6. Usage Considerations and Best Practices

For retrieval, the “Brief Hospital Course” summary should serve as the retrieval query. Sharding is to be performed with the provided code, using five-token chunks plus ±5 context. Both BM25 and embedding-based search are recommended to maximize retrieval recall. Following LLM annotation, minimal postprocessing is advised for duplicate and invalid timestamp removal. For supervised training, event–time sequences are concatenated as:

[[TIME] Δt1; [EVENT] E1; ; [TIME] Δtm; [EVENT] Em][\texttt{[TIME]}~\Delta t_1;\ \texttt{[EVENT]}~E_1;\ \ldots;\ \texttt{[TIME]}~\Delta t_m;\ \texttt{[EVENT]}~E_m]

Sample pseudo-API (abridged from original):

1
2
3
4
5
6
7
8
9
10
from mimic_ext_22mcts import load_events
df = load_events(split='train')
chunks = chunk_summary(my_summary, chunk_size=5, ctx=5)
bm25  = BM25(corpus=chunks)
hits  = bm25.query(bhc_summary, k=100)
embs_q = embedder.encode(bhc_summary)
embs_c = embedder.encode(chunks)
sem_hits = [c for c,s in zip(chunks,sim(embs_q,embs_c)) if s>=0.75]
candidates = set(hits) | set(sem_hits)
annotations = llama_annotate(candidates, prompt=standard_prompt)

7. Limitations and Prospective Extensions

Several limitations are inherent in the current release:

  • Domain Bias: The dataset is limited to adult ICU patients at a single U.S. center; generalizability to pediatric, outpatient, or non-U.S. settings remains untested.
  • Event Granularity: Extracted events are free text, lacking alignment to ontologies such as UMLS/SNOMED; structured coding is left to downstream mapping.
  • Timestamp Noise: Relative times rest on LLM inference; occasional clinically implausible or hallucinated values are possible.
  • Metadata: Demographics and structured EHR (labs, vitals) are absent.

Future work could integrate structured EHR (e.g., labs, vitals), align events post hoc to ontology codes, extend the framework to additional note types (progress, radiology), and incorporate uncertainty quantification on timestamps (e.g., confidence intervals). Model fine-tuning at larger scale (Llama 3 70B, GPT-4-class) is also a suggested avenue.

In summary, MIMIC-IV-Ext-22MCTS establishes a unique resource at scale for fine-grained, temporally explicit modeling of clinical risk, supporting benchmarking and development of advanced temporal analysis algorithms for healthcare research (Wang et al., 1 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MIMIC-4-Ext-22MCTS.