MIMIC-CXR-JPG Dataset
- MIMIC-CXR-JPG is a large-scale, publicly available chest radiograph dataset with 377,110 high-resolution images from 227,835 studies of 65,379 patients.
- It employs a rigorous image processing pipeline including normalization, intensity inversion, contrast enhancement, and JPEG compression at quality 95 to convert 16-bit DICOM files into 8-bit JPEGs.
- The dataset features multi-label extraction using NLP tools like CheXpert and NegBio, supporting research in disease classification, long-tailed learning, and vision-language model benchmarking.
MIMIC-CXR-JPG is a large-scale, publicly available dataset of de-identified chest radiographs (CXR) designed to facilitate algorithmic research in medical computer vision. Released by the MIT Laboratory for Computational Physiology in 2019, this resource provides high-resolution JPEG images and associated structured labels derived from free-text radiology reports acquired at Beth Israel Deaconess Medical Center (BIDMC) between 2011 and 2016. MIMIC-CXR-JPG has become foundational for benchmarking image-based methods in multi-label disease classification, interpretability, and long-tailed learning in CXR analysis (Johnson et al., 2019, Madhipati et al., 25 Jul 2025, Williams et al., 2022).
1. Dataset Composition and Demographics
With 377,110 chest X-ray images from 227,835 imaging studies and 65,379 unique patients, MIMIC-CXR-JPG is, by design, both large and demographically diverse. Images were generated from the parent MIMIC-CXR (DICOM) dataset and were de-identified via automated, HIPAA-compliant procedures; any suspected protected health information (PHI) was masked. The view distribution in the dataset is predominantly frontal (251,714 images, 66.8%), with the remainder comprising lateral views (122,538, 32.5%) and a small number labeled as other projections (858). Demographic attributes such as age, gender, and race are available by cross-referencing the parent database, but were not released directly in the earliest JPG-specific papers (Johnson et al., 2019).
2. Image Processing and Data Format
The dataset's images are 8-bit single-channel JPEG files, derived from original 16-bit DICOM images. The processing pipeline encompasses:
- Linear normalization of pixel intensities: where is the raw DICOM array, and is the result cast to uint8.
- Intensity inversion where dictated by DICOM
PhotometricInterpretation. - Contrast enhancement using histogram equalization (OpenCV's
cv::equalizeHist). - JPEG compression at quality=95, preserving the original spatial resolution. No resizing is performed at this stage (Johnson et al., 2019).
Data consumers resize images to required input dimensions in downstream tasks; for example, 224×224 for ResNet50 and CLIP ViT-B/32 (Williams et al., 2022, Madhipati et al., 25 Jul 2025).
3. Labeling and Ground Truth Extraction
MIMIC-CXR-JPG's primary labels encompass 14 radiographic findings, adopted from the CheXpert vocabulary. These include Atelectasis, Cardiomegaly, Pleural Effusion, Consolidation, Edema, “No Finding,” and others relevant to thoracic pathology. Labels are derived via an ensemble NLP approach using the CheXpert labeler and NegBio. The system detects disease mentions through dictionaries and context rules, assigning each mention a status: Positive, Uncertain, Negative, or Disagreement (if CheXpert and NegBio diverge) (Johnson et al., 2019).
Label extraction system performance was evaluated on a manually annotated subset, achieving F₁ scores ≥0.93 for common findings such as pneumothorax and pleural effusion. The pipeline was optimized for mention extraction and contextual negation rather than explicit visual ground truth.
Subsequent extensions, such as those supporting the MICCAI “CXR-LT” challenge, employ a radiology report parsing pipeline to yield 40 multi-label findings. Labels are treated as independent binary targets in multi-label learning settings (Madhipati et al., 25 Jul 2025).
4. Split Protocols and Recommended Evaluation
MIMIC-CXR-JPG provides patient-disjoint splits:
- Train: 368,960 images (222,758 studies, 64,586 patients)
- Validate: 2,991 images (1,808 studies, 500 patients)
- Test: 5,159 images (3,269 studies, 293 patients)
The test set is held out and pathologically enriched via stratified sampling. All splits are provided at the patient level to prevent leakage.
Suggested metrics for evaluation include AUROC per finding (multi-label, one-vs-rest), sensitivity (recall), specificity, and precision-recall curves for low-prevalence pathologies. F₁ scores are specifically reported for mention extraction and context classification tasks in NLP-based report understanding (Johnson et al., 2019, Madhipati et al., 25 Jul 2025).
5. Derived Subsets and Specialized Cohorts
MIMIC-CXR-JPG's versatility supports specialized cohort definition via linkage to the MIMIC-IV clinical data. For example, heart failure research filtered for ICD-10 codes indicating either reduced (HFrEF: EF<40%) or preserved (HFpEF: EF>50%) ejection fraction, yielding 3,488 images: 2,010 HFrEF and 1,478 HFpEF. Demographics for this heart failure cohort include a median age of 71 years (IQR 61–81), sex distribution of 1,579 female to 1,909 male, and race/ethnicity breakdowns (White 60%, Black 19.4%, Asian 3%, Hispanic/Latino 5.4%). Exclusion of overlapping patients between data splits is strictly enforced, and only studies with clear ICD-10-based phenotypes are included (Williams et al., 2022).
6. Class Distribution, Imbalance, and Long-Tailed Phenomena
The multi-label structure yields a markedly imbalanced class distribution. Of the 40-class label set established for CXR-LT and “long-tail” research, 11 classes are “common” (>10,000 images each), 17 “medium” (1,000–10,000), and 12 “rare” (<1,000, ≈2% of the total). The most frequent class may exhibit >40× the sample size of the rarest label. This pronounced long tail has motivated methodology such as class-weighting and latent cluster alignment (Gaussian Mixture Model plus Student's t-distribution) to address representation collapse and to improve rare class performance (Madhipati et al., 25 Jul 2025).
| Class Frequency Category | No. of Classes | Typical Sample Size |
|---|---|---|
| Common (>10,000) | 11 | >10,000 |
| Medium (1,000–10,000) | 17 | 1,000–10,000 |
| Rare (<1,000) | 12 | <1,000 |
The base label set's prevalence in training splits includes: “no finding” 33.0%, atelectasis 19.8%, cardiomegaly 17.2%, pleural effusion 23.3%, support devices 28.8%, and others as detailed in the dataset documentation (Johnson et al., 2019).
7. Applications, Performance, and Research Directions
MIMIC-CXR-JPG has established itself as a reference corpus for:
- Multi-label disease classification, including rare pathologies
- Benchmarking vision-LLMs (e.g., CLIP) and self-supervised learning
- Evaluating long-tail and metric-learning strategies (GMM/t-distributions, triplet loss)
- Structured cohort selection for clinical studies (e.g., heart failure phenotyping based on ejection fraction)
Recent work demonstrates macro-AUCs ranging from 0.64–0.72 on the 40-class zero-shot task, with rare class performance substantially improved by advanced cluster-weighted contrastive learning (Madhipati et al., 25 Jul 2025). Data augmentation techniques such as random rotation (θ ∼ U[−10°, +10°]) and random cropping also yield performance gains, as observed in specialized heart failure phenotyping (Williams et al., 2022).
All data are accessible for bona-fide research via PhysioNet, subject to data use agreements prohibiting re-identification and requiring the sharing of derived code (Johnson et al., 2019).
References
- "MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs" (Johnson et al., 2019)
- "CXR-CML: Improved zero-shot classification of long-tailed multi-label diseases in Chest X-Rays" (Madhipati et al., 25 Jul 2025)
- "Predicting Ejection Fraction from Chest X-rays Using Computer Vision for Diagnosing Heart Failure" (Williams et al., 2022)