Papers
Topics
Authors
Recent
Search
2000 character limit reached

CheXpert Plus: Radiology AI Dataset

Updated 14 February 2026
  • CheXpert Plus is a comprehensive multimodal radiology dataset featuring 223,228 chest X-ray studies linked to detailed radiology reports and rich demographic metadata.
  • The dataset supports diverse tasks such as disease classification, report generation, and fairness analysis, leveraging both rule-based and neural labeling pipelines.
  • It incorporates advanced de-identification, RadGraph annotations, and standardized benchmarking protocols to drive state-of-the-art radiology AI development.

CheXpert Plus is a large-scale, multimodal public dataset designed to advance the development of robust, fair, and clinically meaningful machine learning models for chest radiograph analysis. Serving as a comprehensive extension to the original CheXpert dataset, CheXpert Plus introduces richly annotated chest X-ray images in DICOM fidelity, detailed radiology reports, granular demographic and socio-economic metadata, and multiple image formats, together supporting a spectrum of radiology AI tasks including disease classification, report generation, and fairness analysis (Chambon et al., 2024).

1. Dataset Composition and Structure

CheXpert Plus comprises 223,228 de-identified chest X-ray studies available in both original DICOM and PNG formats, covering 64,725 unique patient records. Each imaging study is linked to a corresponding radiology report, yielding a total of 187,711 free-text, section-parsed documents. The text corpus includes 36.47 million tokens (13.35 million from Impression sections), making CheXpert Plus the largest public text dataset in radiology. Each image–report pair is indexed through robust privacy-preserving identifiers and is accompanied by 47 DICOM header fields (e.g., PixelSpacing, WindowCenter, PhotometricInterpretation, ViewPosition), enabling pixel-level and metadata-informed analyses.

Reports are parsed into up to 11 standardized subsections: Narrative, Clinical History, History, Comparison, Technique, Procedure Comments, Findings, Impression, End of Impression, Summary, and Accession Number. The Impression section is present in every report (mean 71 tokens, median 67, σ ≈ 30), with Findings (mean ≈ 102 tokens when present) and Clinical History (mean ≈ 14 tokens) commonly available.

2. Annotations: Labels, Demographics, and RadGraph

Fourteen thoracic findings are exhaustively annotated per image-report pair: Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity, No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax, Support Devices. Each finding is labeled in four mutually exclusive categories: Positive (=1), Uncertain (–1), Negative (=0), Not Mentioned. CheXpert Plus distributes both the canonical rule-based CheXpert labels and transformer-based CheXbert labels, computed for full reports, Impression and Findings sections, and their concatenations.

Paired structured metadata includes eight demographic and socio-economic variables: age, sex, race (White, Black, Asian, Pacific Islander, Native, Other, Unknown, Refused), ethnicity (Hispanic, Non-Hispanic, Unknown, Refused), insurance type (Medicare, Medicaid, Private, Other, Unknown), body-mass index, interpreter requirement, and deceased status. This extensive metadata enables subgroup analysis, fairness-aware auditing, and context-sensitive modeling.

CheXpert Plus also includes RadGraph annotations: over 5 million graph entities and relations mapping radiology findings and anatomical observations. Entities are classified as Anatomy, Observation: Definitely Present, Observation: Uncertain, or Observation: Definitely Absent; relations encompass Modify, Located At, or Suggestive Of.

3. Data Curation and De-identification

Public release of CheXpert Plus is underpinned by a stringent multi-phase de-identification pipeline. For imaging data, an eight-step DICOM scrubbing removes or replaces all PHI-bearing header fields while retaining original pixel arrays. Radiology reports undergo four sequential stages of PHI removal:

  • Transformer-based token classification into HIPAA-defined PHI categories
  • “Hide in Plain Sight” substitution of detected spans with synthetic surrogates
  • Human annotation/review for missed PHI spans (with a fully missed rate of 0.002% and a partially missed rate of 0.01% across 853,878 true PHI spans)
  • Final board-certified radiologist review of synthetic PHI mappings.

These procedures constitute the largest de-identification effort in public radiology to date.

4. Modeling and Benchmarking on CheXpert Plus

The dataset has catalyzed the development of both classical and state-of-the-art models, including released “model zoo” baselines: fine-tuned LLaMA for report generation, CLIP for image–text retrieval, VQ-GAN for image synthesis, self-supervised DINOv2 for representation learning, and non-architecture-specific report-generation/summarization networks (Chambon et al., 2024). Recent benchmarking on medical report generation, as detailed in CXPMRG-Bench, contrasts 21 mainstream MRG algorithms and multiple LLMs/VLMs. For instance, the MambaXray-VL-Large model achieved state-of-the-art results (BLEU-4 = 0.112, ROUGE-L = 0.276, METEOR = 0.157, CIDEr = 0.139, clinical efficacy F1 = 0.335). Models benefiting from multi-stage pre-training (self-supervised vision, image–text contrastive, supervised fine-tuning) show significant gains in both language metrics and clinical accuracy (Wang et al., 2024).

MetaCheX demonstrates performance augmentation via patient metadata integration, achieving AUROC improvements (EfficientNet-B3: 0.85538 → 0.88205; ResNet-50: 0.86165 → 0.87998; VGG-16: 0.85201 → 0.87263) over image-only baselines. Incorporation of demographics narrows fairness disparities (e.g., male vs. female equalized odds difference from 0.12 to 0.07), improves demographic parity, and increases robustness to domain shift (He et al., 15 Sep 2025).

5. Labeling Pipelines: CheXpert++ and CheXbert

CheXpert Plus natively supports both the original rule-based CheXpert labeler and its high-fidelity neural approximation CheXpert++ (McDermott et al., 2020). The CheXpert++ architecture is BERT-based, trained on MIMIC-CXR sentences to output 14 clinical labels via multitask classification heads; it achieves 99.81% parity with CheXpert and offers differentiability and calibrated probability outputs. On a held-out gold set of 540 sentences, CheXpert++ improved average accuracy from 71.6% (CheXpert) to 79.1% after one iteration of active learning, with clinicians preferring its labels in 60% of sampled disagreements. This positions CheXpert++ as a drop-in replacement, enabling differentiable and probabilistic label supervision in downstream pipelines.

RadGraph adds fine-grained ontology annotations by mapping report sections to structured entity–relation graphs, providing a framework for advanced information extraction and factuality checking.

6. Data Accessibility, Limitations, and Research Impact

All CheXpert Plus data are available under Stanford’s Research Use Agreement, with models and code distributed via the Stanford-AIMI and partnered repositories (Chambon et al., 2024). Compared to other public datasets (e.g., MIMIC-CXR), CheXpert Plus nearly doubles the magnitude of English chest X-ray/report pairs but is characterized by a skew against rare findings and a greater emphasis on Impression sections.

CheXpert Plus uniquely facilitates cross-institutional training, analysis of subgroup bias, robustness evaluation, and large-scale vision-language modeling. Its combination of pixel-level fidelity, granular text, and detailed patient context establishes it as a primary resource for both diagnostic AI development and fairness-driven research in radiology.

7. Applications and Benchmarking Protocols

CheXpert Plus supports a broad range of applications including disease detection, automated report generation, cross-modal retrieval, information extraction, and fairness analysis. The CXPMRG-Bench protocol recommends patient-level split (7:1:2 for train:val:test), using Findings sections as reference, and assessing both language quality (BLEU-4, ROUGE-L, METEOR, CIDEr) and clinical efficacy (F1 on pathology tags). Integration of structured knowledge graphs (e.g., RadGraph) in model objectives is identified as crucial for improving factual consistency in generated radiology reports. Multi-stage pre-training that leverages unpaired image self-supervision, image–text contrastive learning, and supervised fine-tuning is recommended for domain adaptation and rare finding detection (Wang et al., 2024).

CheXpert Plus’ position as the largest multimodal radiology resource with granular demographic data, extensive label infrastructure, and validated de-identification practices ensures its centrality in the evaluation and development of next-generation radiology AI systems (Chambon et al., 2024, McDermott et al., 2020, He et al., 15 Sep 2025, Wang et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CheXpert Plus.