Papers
Topics
Authors
Recent
2000 character limit reached

MIMIC-CXR-LT Dataset Overview

Updated 8 December 2025
  • MIMIC-CXR-LT is a curated family of large-scale chest radiograph datasets designed for long-tailed, multi-label, and zero-shot disease classification.
  • It leverages advanced NLP annotation pipelines like RadText and CheXpert to generate multi-hot labels, preserving co-occurrence patterns and managing label noise.
  • Robust preprocessing, stratified patient-level splits, and data augmentation strategies enable standardized benchmarking and fairness analysis in clinical imaging.

The MIMIC-CXR-LT dataset refers to a family of large-scale chest radiograph collections, derived from the MIMIC-CXR archive, explicitly curated for research in long-tailed, multi-label, and (in the most recent releases) zero-shot disease classification. These datasets exemplify clinically realistic imbalances, noise sources, and co-occurrence patterns, and form the basis for a range of MICCAI/ICCV/ICMI benchmarks and challenges in automated chest X-ray (CXR) interpretation. This entry reviews the technical design, label ontologies, data curation, split statistics, downstream benchmarks, and methodological consequences of the MIMIC-CXR-LT series and directly related variants.

1. Dataset Genesis, Curation, and Versions

MIMIC-CXR-LT is constructed from MIMIC-CXR-JPG (v2.0.0), a de-identified corpus of over 377,000 chest radiographs (JPEG format) acquired at Beth Israel Deaconess Medical Center (2011–2016), paired with corresponding radiology reports (Rubin et al., 2018, Holste et al., 2023, Lin et al., 9 Jun 2025, Lai et al., 2023, Huang et al., 16 Nov 2024). The dataset family is heterogeneous by design, reflecting evolving community best practices in handling long-tailed and multi-label disease distributions:

  • Original CXR-LT Challenge (2023): 377,110 images (frontal and lateral), annotated for 26 clinical findings via rule-based NLP (RadText), partitioned into train/dev/test by patient (≈ 70/10/20%). Every image possesses at least one positive label. The label set includes common thoracic findings (pleural effusion, support devices, etc.) and 12 rare diseases extracted via RadText mapping.
  • CXR-LT 2024 Expansion: 377,110 images, 45 labels (14 original, 12 from 2023, 19 newly added from PadChest/Fleischner glossary). Five of these (Bulla, Cardiomyopathy, Hilum, Osteopenia, Scoliosis) are withheld entirely from training for dedicated zero-shot benchmarking. Gold-standard evaluation on 406 hand-annotated studies is provided.
  • Pruned and Specialized Derivatives: The Pruned MIMIC-CXR-LT (Huang et al., 16 Nov 2024) applies strict view, age, and label filters, yielding 257,018 frontal studies with 19 class labels and demographic metadata for fairness analysis. LTML-MIMIC-CXR (Lai et al., 2023) expands the ontology to 39 labels (13 common, 26 rare), focused on rare-disease sensitivity.

All dataset variants enforce patient-level splits to prevent data leakage and stratify splits to preserve the empirical long-tail.

2. Label Taxonomy and NLP Annotation Pipelines

MIMIC-CXR-LT’s label ontology is built to capture both high-prevalence findings and very rare abnormalities. Annotation relies on automatic NLP methods:

  • NLP Tools: RadText (primary for CXR-LT challenges), CheXpert labeler (for Pruned/other variants), and NegBio (Rubin et al., 2018, Holste et al., 2023, Lin et al., 9 Jun 2025, Lai et al., 2023, Huang et al., 16 Nov 2024).
  • Label Expansion: The original 14 MIMIC-CXR classes are extended—first to 26 (including rare findings) and then to 45 in CXR-LT 2024 using external ontologies (PadChest, Fleischner glossary, Gamuts), always permitting multi-label (multi-hot) annotation.
  • Positive Label Assignment: A label is positive if affirmed in the report and not negated or uncertain per the NLP output. “No Finding” is only marked when no pathologic finding (apart from “Support Devices”) is extracted.
  • Co-Occurrence: Co-occurrence statistics are preserved and presented as conditional frequency matrices, never decorrelated. Each image may have multiple abnormal findings assigned.

In some variants (e.g., LTML-MIMIC-CXR), rule-based outputs are manually reviewed on test splits, and data-driven denoising strategies are implemented for improved rare disease fidelity (Lai et al., 2023, Lin et al., 9 Jun 2025).

3. Data Splits, Class Distribution, and Imbalance Properties

All MIMIC-CXR-LT datasets enforce patient-level stratification. Split sizes and class structure from canonical releases are summarized below.

Version / Split Image Count Label Set Size Head Classes* Tail Classes*
CXR-LT (2023/2024) Train 258,871 40/45 9 15–16 (<1%)
CXR-LT Dev / Test 39,293/78,946 40/45
Pruned MIMIC-CXR-LT 182,380/20,360/54,268 19 9 (>10,000) 5 (≤1,000)
LTML-MIMIC-CXR 206,477 total 39 13 12 (<1,000)

*Bucket definitions follow published papers: “head” (>10%), “medium” (1–10%), “tail” (<1%) prevalence (Holste et al., 2023, Lin et al., 9 Jun 2025, Lai et al., 2023, Huang et al., 16 Nov 2024).

  • Imbalance Ratio: In CXR-LT 2024, Nmax/Nmin>400N_{\max}/N_{\min} > 400 with some labels appearing in tens of thousands of studies and others in only a few hundred. Rank–frequency curves follow approximate Zipf-like decay (Holste et al., 2023, Lin et al., 9 Jun 2025).
  • Class Distribution: For CXR-LT, two most common classes (“Support Devices,” “Pleural Effusion”) each appear in >60% of training images; rarest classes are present in <0.2%. In Pruned variant, the head-tail split spans nearly three orders of magnitude (104,364 vs. 553 samples) (Huang et al., 16 Nov 2024).
  • Co-occurrence: Multi-label co-occurrence patterns (e.g., Edema↔Effusion) are central to data modeling and evaluation (Holste et al., 2023).

4. Preprocessing, Standardization, and Noise Handling

Image preprocessing pipelines vary by release:

Successful teams employ generative denoising models, Noisy Student self-training, and manual review to mitigate noise—reporting a near doubling in mAP on gold-standard labels versus the full noisy test (Lin et al., 9 Jun 2025, Lai et al., 2023).

5. Benchmark Tasks, Metrics, and Modeling Methodologies

MIMIC-CXR-LT supports a range of research tasks, all formulated as multi-label classification:

  • Long-Tailed Multi-Label Benchmark: Standard task is to predict a multi-hot vector of findings per image, emphasizing robustness to severe class imbalance.
  • Zero-Shot Classification: In CXR-LT 2024, five labels are withheld entirely from training; evaluation quantifies model’s ability to generalize to unseen diseases (Lin et al., 9 Jun 2025).
  • Gold-Standard Evaluation: On the hand-annotated subset, performance is measured under near noise-free conditions.

Metrics:

  • Macro-averaged AP (mAP):

mAP=1Cc=1CAPc\mathrm{mAP} = \frac{1}{C}\sum_{c=1}^C \mathrm{AP}_c

where APc\mathrm{AP}_c is the area under the precision–recall curve for class cc (Holste et al., 2023, Lin et al., 9 Jun 2025).

Methodological Innovations:

  • Loss Reweighting / Negative Logits Handling: ANR-BCE (adaptive negative regularization for binary cross-entropy) and large loss reconsideration (LLR)—key for boosting recall in rare disease detection (Lai et al., 2023).
  • Pretraining and Multimodal Methods: Transfer from CheXpert, NIH-CXR, VinDr-CXR, and use of ConvNeXt, EfficientNet, ViT, and ML-Decoder backbones (Lin et al., 9 Jun 2025, Huang et al., 16 Nov 2024).
  • Vision-Language Fusion: Multimodal approaches with ML-Decoder, CLIP, and prompt-based vision-LLMs dominate recent benchmarks, particularly for zero-shot generalization (Lin et al., 9 Jun 2025).
  • Synthetic Augmentation: Denoising diffusion and tailored resampling are effective for underrepresented classes (Lin et al., 9 Jun 2025).

6. Usage Protocols, Access Control, and Practical Recommendations

Access to all MIMIC-CXR-LT variants is gated via PhysioNet; compliance with the data use agreement and prior MIMIC-CXR credentials (human subject training, IRB approval) is required (Holste et al., 2023, Huang et al., 16 Nov 2024). Usage guidelines and best practices include:

  • Subject-level Split Enforcement: Always split by patient to avoid leakage.
  • Label Quality Auditing: Validate NLP-derived labels on a manual subset before clinical deployment.
  • Fairness Assessment: Employ metrics (EO, FNR) by demographic subgroup, and stratify results to quantify potential algorithmic bias (Huang et al., 16 Nov 2024).
  • Evaluative Resolution: For training, resolutions ≥300×300; multiple studies train at 512×512 or higher (Holste et al., 2023).
  • Reporting: Always report macro-averaged metrics, but also provide per-class breakdown (especially for rare pathologies and clinical risk stratification) (Holste et al., 2023, Lin et al., 9 Jun 2025, Huang et al., 16 Nov 2024).

7. Limitations, Critiques, and Prospective Directions

Known limitations span from inherent biases (age, sex missing in most public releases; ICU bias in AP vs. PA) to residual errors from NLP labeling (false negatives especially on rare classes) and lack of multimodal clinical context (Rubin et al., 2018, Huang et al., 16 Nov 2024, Lin et al., 9 Jun 2025). MIMIC-CXR-LT does not incorporate longitudinal data or non-pixel context.

Recent recommendations emphasize:

  • The incorporation of additional gold-standard annotations via active learning.
  • Multi-institutional validation to counteract site bias.
  • Release of richer subgroup metadata.
  • Fusion with LLM-based relabeling to scale high-fidelity labels (Lin et al., 9 Jun 2025).

MIMIC-CXR-LT remains a pivotal open benchmark for research into real-world long-tailed, multi-label, and zero-shot medical image classification, enabling systematic development and evaluation of robust, fair, and generalizable diagnostic models (Holste et al., 2023, Lin et al., 9 Jun 2025, Lai et al., 2023, Huang et al., 16 Nov 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MIMIC-CXR-LT Dataset.