MIMIC-IV-Ext-22MCTS Dataset
- MIMIC-IV-Ext-22MCTS is a temporally annotated clinical time-series dataset containing over 22 million event–timestamp pairs from discharge summaries.
- It employs a hybrid retrieval pipeline using BM25 and semantic search to efficiently extract short-span clinical events with explicit or inferred temporal cues.
- The dataset boosts benchmarks in risk prediction, medical Q&A, clinical trial matching, and text generation, providing actionable insights for healthcare research.
The MIMIC-IV-Ext-22MCTS dataset is a large-scale, temporally annotated clinical time-series corpus constructed for risk prediction and related machine learning tasks in healthcare. Containing 22,588,586 event–timestamp pairs from 267,284 discharge summaries, it systematically addresses the limitations of unstructured clinical narrative data in MIMIC-IV-Note by extracting short-span clinical events with inferred or explicit relative temporal information. The dataset supports benchmarking for causal-correlation classification, medical question answering, clinical trial matching, and clinical text generation, and has been demonstrated to yield substantial improvements in standard model performance (Wang et al., 1 May 2025).
1. Data Origination and Preprocessing
MIMIC-IV-Ext-22MCTS is derived from the “discharge summary” section of MIMIC-IV-Note, comprising 331,794 free-text narratives (mean length 2,267 ± 914 tokens) documenting hospitalizations. The principal challenges in utilizing this corpus for machine learning are document length exceeding transformer context limitations (e.g., BERT's 512 tokens), and the sparse, implicit temporal cues for clinical events. To mitigate these issues, each summary is segmented into atomic “chunks”—blocks of at most five consecutive tokens plus five-token left/right context windows. For token sequence , chunk is defined as with , . Context tokens and are included when available, yielding 100–400 chunks per summary.
2. Clinical Event Candidate Retrieval
Due to both computation and LLM hallucination risk, not all text chunks are passed to the LLM for annotation. Instead, a hybrid retrieval pipeline selects a high-recall candidate set through two complementary mechanisms:
- Contextual BM25: The “Brief Hospital Course” section, denoted , serves as a query against all chunks . Chunks are scored by the standard BM25 ranking function:
with , , and top chunks retained.
- Contextual Semantic Search: Both query and every chunk are embedded into via “BAAI/bge-large-en”, and cosine similarity computed:
Chunks with form a secondary candidate set.
The deduplicated union of both retrieval methods (typically 200–400 chunks per summary) is passed to the Llama-3.1-8B model for extraction and annotation.
3. Temporal Information Extraction and Event Annotation
Event–timestamp extraction relies on a chain-of-thought prompt given to Llama-3.1-8B. Event guidance instructs the model to identify all health-related actions, symptoms, or clinical states in the chunk, explicitly splitting conjunctive mentions (e.g., “fever and cough” → separate events). Timestamp guidance treats hospital admission (“Admission to hospital”) as time zero; events prior to admission receive negative hours, post-admission positive hours, using explicit or inferred cues. For event with inferred timestamp , the relative time is computed as:
with by convention. When explicit temporal data are absent, model judgment translates natural language intervals (e.g., “a few weeks ago” to hours).
4. Dataset Composition and Structure
The dataset comprises 267,284 summaries (conditions: extracted event/time pair), 22,588,586 event–time pairs (mean 84 per summary, max 244), and an average event-span length of tokens (max 299). The relative time distribution is 36.99% pre-admission (), 51.19% during hospitalization (), and 11.80% after discharge ( beyond typical stay). The schema is as follows:
| Column | Type | Description |
|---|---|---|
| hadm_id | INT | Unique hospital-admission identifier |
| event | TEXT | Untokenized free-text clinical span |
| time | INT | Relative timestamp (hours) |
| time_bin | INT | Discretized bin |
Time bins are defined by
5. Benchmarking and Performance
MIMIC-IV-Ext-22MCTS supports several benchmark predictive and generative tasks:
- Causal-Correlation Classification: For pairs , , the model predicts whether is a consequence, possible cause, or uncorrelated. Temporal BERT architectures (BERT-base-uncased with added time-embedding) are used. Training employs an 80/20 train–val split, five epochs, Adam ().
- Medical Question Answering: Using PubMedQA, baseline BERT achieves accuracy, while Temporal BERT fine-tuned on MIMIC-IV-Ext-22MCTS achieves (10 percentage point absolute gain).
- Clinical Trial Matching: On TREC 2021/2022 benchmarks, improvements in NDCG@10 observed (e.g., 33.28 → 36.53 for 2021; 29.43 → 29.94 for 2022), with similar gains in Precision@10 and Recall@100.
- Text Generation: GPT-2 fine-tuned on the dataset, with additional [TIME] and [EVENT] tokens, produces more clinically coherent outputs (e.g., correct drug-dose recommendations) than baseline GPT-2.
6. Usage Considerations and Best Practices
For retrieval, the “Brief Hospital Course” summary should serve as the retrieval query. Sharding is to be performed with the provided code, using five-token chunks plus ±5 context. Both BM25 and embedding-based search are recommended to maximize retrieval recall. Following LLM annotation, minimal postprocessing is advised for duplicate and invalid timestamp removal. For supervised training, event–time sequences are concatenated as:
Sample pseudo-API (abridged from original):
1 2 3 4 5 6 7 8 9 10 |
from mimic_ext_22mcts import load_events df = load_events(split='train') chunks = chunk_summary(my_summary, chunk_size=5, ctx=5) bm25 = BM25(corpus=chunks) hits = bm25.query(bhc_summary, k=100) embs_q = embedder.encode(bhc_summary) embs_c = embedder.encode(chunks) sem_hits = [c for c,s in zip(chunks,sim(embs_q,embs_c)) if s>=0.75] candidates = set(hits) | set(sem_hits) annotations = llama_annotate(candidates, prompt=standard_prompt) |
7. Limitations and Prospective Extensions
Several limitations are inherent in the current release:
- Domain Bias: The dataset is limited to adult ICU patients at a single U.S. center; generalizability to pediatric, outpatient, or non-U.S. settings remains untested.
- Event Granularity: Extracted events are free text, lacking alignment to ontologies such as UMLS/SNOMED; structured coding is left to downstream mapping.
- Timestamp Noise: Relative times rest on LLM inference; occasional clinically implausible or hallucinated values are possible.
- Metadata: Demographics and structured EHR (labs, vitals) are absent.
Future work could integrate structured EHR (e.g., labs, vitals), align events post hoc to ontology codes, extend the framework to additional note types (progress, radiology), and incorporate uncertainty quantification on timestamps (e.g., confidence intervals). Model fine-tuning at larger scale (Llama 3 70B, GPT-4-class) is also a suggested avenue.
In summary, MIMIC-IV-Ext-22MCTS establishes a unique resource at scale for fine-grained, temporally explicit modeling of clinical risk, supporting benchmarking and development of advanced temporal analysis algorithms for healthcare research (Wang et al., 1 May 2025).