CheXpert Plus: Radiology Multimodal Dataset

Updated 22 September 2025

CheXpert Plus is a multimodal chest X-ray dataset that provides high-quality images, detailed radiology reports, and comprehensive patient metadata.
It integrates diverse formats including DICOM and PNG images, pathology labels, and RadGraph entity/relation annotations to support cross-institutional AI research.
Its extensive de-identification process and longitudinal data structure enable robust model training, fairness audits, and advanced clinical decision support.

CheXpert Plus is a multimodal chest X-ray dataset designed to advance clinical artificial intelligence through expanded data scale, richer modalities, extensive de-identification, and detailed annotation. It pairs radiology reports, high-quality DICOM and PNG images, patient demographic metadata, comprehensive pathology labels, and RadGraph entity/relation annotations. CheXpert Plus provides the largest publicly-released text corpus in radiology (36 million tokens, of which 13 million are Impression section tokens) and represents the most extensive de-identification effort reported, with nearly one million PHI spans anonymized. By combining these resources, CheXpert Plus supports cross-institutional, robust, and fair model development for next-generation radiology AI. Data and models are publicly available for research use (Chambon et al., 2024).

1. Dataset Composition and Scope

CheXpert Plus consists of 223,228 unique chest X-ray images (DICOM format, with PNG alternatives), each paired with a radiology report parsed into up to 11 structured sections (e.g., Narrative, Impression, Findings, Clinical History, Comparison, Technique, Procedure Comments, End of Impression, Summary, and Accession Number). The collection includes 187,711 unique reports spanning 64,725 patients, each annotated with detailed metadata (age, sex, race, ethnicity, insurance type, BMI, deceased status, interpreter need). The total token count reaches 36 million, with Impression texts comprising approximately 13 million tokens.

Pathology labeling covers 14 thoracic conditions, including both CheXpert and CheXbert annotation pipelines. All report sections and image files are linked via consistent identifiers and are chronologically ordered per patient (via the patient_report_date_order field), supporting longitudinal research on disease progression and outcome prediction.

RadGraph annotations are integrated for Findings and Impression sections, providing over 5.5 million clinical entity-relationship pairs. Entities are classified (e.g., “Anatomy,” “Observation: Definitely Present"), and relations (e.g., “Modify,” “Located at,” “Suggestive of”) are precisely tagged, supporting graph-based representation learning and extraction of detailed information.

2. Image Formats, Metadata, and De-identification

All chest X-rays are released in the gold-standard DICOM format, preserving high fidelity and clinical compatibility, with PNG conversion for accessibility. DICOM metadata is comprehensive: up to 47 elements are available (such as PixelData, BitsAllocated, Rows, Columns, WindowCenter, Manufacturer, Slice Thickness) allowing for precise downstream preprocessing. Image quality and content are preserved due to rigorous conversion protocols.

The text data features the largest radiology de-identification to date: nearly 1 million PHI spans are removed or anonymized. A two-step de-identification pipeline, followed by expert human review, ensures near-complete privacy protection while retaining clinical value. This enables dataset sharing for research across institutions and enables cross-dataset training at scale.

3. Pathology Labeling and RadGraph Annotation Methodology

Pathology labels for the 14 target conditions are assigned automatically via the CheXpert and CheXbert pipelines. Reports are parsed, extracting assertions from key sections, particularly Impression and Findings. Each label is classified as positive, negative, uncertain, or no mention. RadGraph annotation further augments this process by segmenting report text into granular entities and mapping relationships, providing >1.5 million entity annotations for Findings and >4 million for Impressions, alongside over 3.8 million relation annotations.

Such detail enables the development of supervised and self-supervised models for disease classification, report generation, and multi-task learning. It directly supports fairness research by making demographic cohorts explicit and enables broad transfer learning due to the multimodal and richly structured data.

4. Applications and Intended Model Use

CheXpert Plus was designed for AI model development in image-text localization, multimodal vision-language modeling, radiology report generation and summarization, and robust pathology classification. The paired image-text and metadata structure supports training of models with architectural components such as CLIP, VQ-GAN, DINOv2, LLaMA-based language heads, and transformer-based vision backbones.

Longitudinal ordering allows models to learn temporal patterns, advancing research on disease progression. Rich patient metadata supports training and evaluation on bias, fairness, and generalizability across socio-economic and clinical subpopulations. Enhanced annotation (RadGraph, pathological assertion, structured report sections) informs intermediate supervision and hierarchical multitask setups.

Benchmark frameworks such as CXPMRG-Bench have retrained and standardized over 35 models on CheXpert Plus, enabling rigorous performance comparison (Wang et al., 2024). State-of-the-art models driven by multimodal pretraining (as seen with autoregressive and contrastive pre-training in MambaXray-VL) achieve strong performance on both natural language generation metrics (BLEU-4 ≈ 0.112) and clinical efficacy metrics (F1 ≈ 0.335).

5. Research Impact, Accessibility, and Licensing

As the largest English paired image–text radiology dataset, CheXpert Plus nearly doubles the available corpus compared to MIMIC or previous CheXpert releases. Its extensive multimodal annotation, cross-institutional patient cohort, and detailed documentation support robust, fair model development and comparison. The dataset underpins recent advances in radiographic report generation, context-aware diagnosis, transfer learning, and bias mitigation (He et al., 15 Sep 2025).

Data is publicly accessible via Stanford AIMI’s shared resource portal, subject to a research use agreement prohibiting commercial application and redistribution. Pretrained model weights for several architectures (LLaMA-based LLMs, CLIP, VQ-GAN, DINOv2) are available through open-source repositories, enabling immediate integration and experimentation.

6. Contextual Significance and Future Directions

CheXpert Plus sets a new benchmark for research in radiology AI, supporting clinical bias and fairness audits, evaluation of multimodal learning algorithms, and next-generation decision support systems. The scale and granularity of report annotation, combined with longitudinal and demographic metadata, empower researchers to develop context-aware models replicating nuanced clinical decision-making.

This suggests future work can leverage CheXpert Plus for:

Improvement in foundation model performance through large-scale self-supervised/stagewise pretraining.
Enhanced cross-institutional generalization by training on diverse patient cohorts.
Fine-grained error analysis with entity-relation graphs and pathology assertion labels.
Fairness audits and subgroup performance evaluations across demographic axes.

By providing both technical rigor and clinical breadth, CheXpert Plus serves as a foundation for comprehensive radiology AI research, model validation, and translation to measurable clinical impact.