Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReXGradient-160K Chest X-ray Dataset

Updated 17 February 2026
  • ReXGradient-160K is a large, multi-institutional chest radiograph dataset comprising 160K studies and paired radiology reports, enabling comprehensive imaging research.
  • It employs standardized preprocessing (DICOM conversion, normalization, and downsampling) and canonical report segmentation to ensure data consistency and reproducibility.
  • The dataset is optimized for advanced applications such as report generation, abnormality detection, and domain generalization, with carefully designed training, validation, and test splits.

ReXGradient-160K is a large-scale, multi-institutional chest radiograph dataset comprising 160,000 studies paired with free-text radiology reports, collected from 109,487 unique patients across three U.S. health systems and 79 distinct medical sites. As of its release, it is the largest publicly available chest X-ray database by patient count, specifically structured to support research in medical imaging AI—including radiology report generation, abnormality detection, and domain generalization. ReXGradient-160K provides comprehensive demographics, multi-view imaging, tokenized reports segmented into canonical sections, and robust data splits. Distribution, licensing, and privacy protocols are designed to ensure rigorous, HIPAA-compliant research accessibility (Zhang et al., 1 May 2025).

1. Dataset Composition and Demographics

ReXGradient-160K contains 160,000 chest radiograph studies corresponding to 273,004 unique images, with each study paired to its respective unaltered radiology report (aside from PHI removal and section segmentation). The dataset encompasses longitudinal records, with individual patients (N=109,487) possibly contributing multiple studies.

Study and patient demographic distributions are summarized in the following table:

Split Studies Images Patients 0–20 (%) 20–40 (%) 40–60 (%) 60–80 (%) 80+ (%) Male (%) Female (%) Unknown (%)
Train 140,000 238,968 95,716 16.9 19.4 25.1 26.0 12.6 49.0 49.5 1.5
Validation 10,000 17,007 6,964 17.9 20.3 25.5 24.0 12.2 48.3 50.4 1.3
Test 10,000 17,029 6,807 17.3 19.6 25.4 24.5 13.2 47.9 50.3 1.8

The sex distribution is balanced (≈49% male, 50% female), and age ranges span pediatric to elderly, with about half the population in the 40–80-year category. Geographic limitation to U.S. health systems introduces demographic and clinical biases inherent to these populations.

2. Imaging Modalities and Data Preprocessing

The repository aggregates images from three main radiographic views: lateral (~35%), posteroanterior (PA, ~31%), and anteroposterior (AP, ~29%) with proportions consistent across datasets. Original studies are transformed as follows:

  • DICOM to PNG conversion using pydicom.
  • MONOCHROME1 images are inverted for visual consistency.
  • Min-max normalization is performed to a 16-bit pixel range [0, 65535].
  • Images are downsampled to 25% of their original dimensions using cubic interpolation with anti-aliasing.
  • Images are organized into per-patient and per-study subfolders identified by pseudonymized IDs.

This standardization facilitates reproducible preprocessing pipelines and consistency in downstream ML and CV applications.

3. Report Structure, Annotation, and Metadata

Radiology reports are segmented into four canonical sections: Indication, Comparison, Findings, and Impression. This segmentation is automated using a GPT-4o-based pipeline followed by rule-based post-processing; no manual inter-annotator adjudication is applied. No structured annotation schema (e.g., bounding boxes) is provided.

Mean token counts per report section for the training set are:

Section Mean Tokens
Indication 5.11
Comparison 2.65
Findings 32.27
Impression 11.17

JSON-formatted metadata accompanies each study, comprising patient and study identifiers (pseudonymized), sex, ethnicity, age, weight, study date (shifted ±365 days), institution, manufacturer, and the segmented report sections.

4. Data Splits, Accessibility, and Benchmarking

ReXGradient-160K is partitioned into four splits:

  • Training: 140,000 studies (238,968 images, 95,716 patients)
  • Validation (public): 10,000 studies (17,007 images, 6,964 patients)
  • Public test: 10,000 studies (17,029 images, 6,807 patients)
  • Private test (ReXrank benchmark): 10,000 studies from 67 sites, reserved for benchmark evaluation

The split design ensures methodological robustness for generalization and benchmarking. The ReXrank platform (https://rexrank.ai) provides a public leaderboard for automated report generation and related tasks. Access to the dataset requires agreement to a HIPAA-compliant license; all patient health information is de-identified or pseudonymized, including temporal (date shifting) and pixel-level redactions where necessary.

5. Evaluation Metrics

While ReXGradient-160K does not introduce new evaluation protocols, it employs standard metrics for text generation and classification in medical imaging:

  • BLEU-n: Computed as

BLEU ⁣ ⁣n=exp(min(1rc,0))i=1npi1n\mathrm{BLEU\!-\!n} = \exp\left(\min\left(1 - \frac{r}{c}, 0\right)\right) \prod_{i=1}^{n} p_i^{\frac{1}{n}}

where pip_i denotes modified precision for ii-grams, rr is the reference length, and cc is the candidate length.

  • ROUGE-L: Based on the longest common subsequence (LCS); its F-score is defined as

RLCS=LCS(C,R)R,PLCS=LCS(C,R)C,ROUGE-L=(1+β2)RLCSPLCSRLCS+β2PLCSR_{\mathrm{LCS}} = \frac{\mathrm{LCS}(C, R)}{|R|}, \quad P_{\mathrm{LCS}} = \frac{\mathrm{LCS}(C, R)}{|C|},\quad \mathrm{ROUGE}\text{-}L = \frac{(1 + \beta^2) R_{\mathrm{LCS}} P_{\mathrm{LCS}}}{R_{\mathrm{LCS}} + \beta^2 P_{\mathrm{LCS}}}

where CC and RR denote the candidate and reference token sequences, respectively.

Clinical concept F1 scores, computed by external tools, are often employed to complement these NLG metrics in benchmark reporting.

6. Research Applications and Limitations

Principal applications include:

  • Radiology report generation (especially using encoder-decoder models)
  • Abnormality detection and image classification
  • Longitudinal disease progression analysis (tracking change across a single patient’s studies)
  • Domain generalization and robustness analysis across health systems, hardware platforms, and sites

Notable limitations:

  • Exclusivity to U.S. health systems imposes geographic and clinical distribution biases.
  • Absence of structured labels, bounding boxes, or manual report annotation may necessitate downstream labeling for tasks beyond text generation or high-level classification.
  • De-identification protocols, including date shifting and pixel-level redaction, may obfuscate rare clinical or technical cues such as the presence of hardware or specific acquisition parameters.
  • Pediatric cases are underrepresented relative to adult populations given cohort characteristics (Zhang et al., 1 May 2025).

7. Data Access and Licensing

Full dataset access is provided via Hugging Face at https://huggingface.co/datasets/rajpurkarlab/ReXGradient-160K, contingent on acceptance of a legally binding data-use and privacy license. All distributed data satisfy HIPAA requirements, leveraging pseudonymization strategies for names and identifiers, consistent date shifting per patient (±\pm365 days), and removal of protected health information from reports and images.

ReXGradient-160K’s unprecedented scale, multi-site representation, and standardized report segmentation render it a substantial contribution for developing and benchmarking next-generation medical imaging machine learning models, particularly within natural language generation and robustness testing contexts (Zhang et al., 1 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReXGradient-160K Dataset.