ReXGradient-160K Chest X-ray Dataset
- ReXGradient-160K is a large, multi-institutional chest radiograph dataset comprising 160K studies and paired radiology reports, enabling comprehensive imaging research.
- It employs standardized preprocessing (DICOM conversion, normalization, and downsampling) and canonical report segmentation to ensure data consistency and reproducibility.
- The dataset is optimized for advanced applications such as report generation, abnormality detection, and domain generalization, with carefully designed training, validation, and test splits.
ReXGradient-160K is a large-scale, multi-institutional chest radiograph dataset comprising 160,000 studies paired with free-text radiology reports, collected from 109,487 unique patients across three U.S. health systems and 79 distinct medical sites. As of its release, it is the largest publicly available chest X-ray database by patient count, specifically structured to support research in medical imaging AI—including radiology report generation, abnormality detection, and domain generalization. ReXGradient-160K provides comprehensive demographics, multi-view imaging, tokenized reports segmented into canonical sections, and robust data splits. Distribution, licensing, and privacy protocols are designed to ensure rigorous, HIPAA-compliant research accessibility (Zhang et al., 1 May 2025).
1. Dataset Composition and Demographics
ReXGradient-160K contains 160,000 chest radiograph studies corresponding to 273,004 unique images, with each study paired to its respective unaltered radiology report (aside from PHI removal and section segmentation). The dataset encompasses longitudinal records, with individual patients (N=109,487) possibly contributing multiple studies.
Study and patient demographic distributions are summarized in the following table:
| Split | Studies | Images | Patients | 0–20 (%) | 20–40 (%) | 40–60 (%) | 60–80 (%) | 80+ (%) | Male (%) | Female (%) | Unknown (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Train | 140,000 | 238,968 | 95,716 | 16.9 | 19.4 | 25.1 | 26.0 | 12.6 | 49.0 | 49.5 | 1.5 |
| Validation | 10,000 | 17,007 | 6,964 | 17.9 | 20.3 | 25.5 | 24.0 | 12.2 | 48.3 | 50.4 | 1.3 |
| Test | 10,000 | 17,029 | 6,807 | 17.3 | 19.6 | 25.4 | 24.5 | 13.2 | 47.9 | 50.3 | 1.8 |
The sex distribution is balanced (≈49% male, 50% female), and age ranges span pediatric to elderly, with about half the population in the 40–80-year category. Geographic limitation to U.S. health systems introduces demographic and clinical biases inherent to these populations.
2. Imaging Modalities and Data Preprocessing
The repository aggregates images from three main radiographic views: lateral (~35%), posteroanterior (PA, ~31%), and anteroposterior (AP, ~29%) with proportions consistent across datasets. Original studies are transformed as follows:
- DICOM to PNG conversion using pydicom.
- MONOCHROME1 images are inverted for visual consistency.
- Min-max normalization is performed to a 16-bit pixel range [0, 65535].
- Images are downsampled to 25% of their original dimensions using cubic interpolation with anti-aliasing.
- Images are organized into per-patient and per-study subfolders identified by pseudonymized IDs.
This standardization facilitates reproducible preprocessing pipelines and consistency in downstream ML and CV applications.
3. Report Structure, Annotation, and Metadata
Radiology reports are segmented into four canonical sections: Indication, Comparison, Findings, and Impression. This segmentation is automated using a GPT-4o-based pipeline followed by rule-based post-processing; no manual inter-annotator adjudication is applied. No structured annotation schema (e.g., bounding boxes) is provided.
Mean token counts per report section for the training set are:
| Section | Mean Tokens |
|---|---|
| Indication | 5.11 |
| Comparison | 2.65 |
| Findings | 32.27 |
| Impression | 11.17 |
JSON-formatted metadata accompanies each study, comprising patient and study identifiers (pseudonymized), sex, ethnicity, age, weight, study date (shifted ±365 days), institution, manufacturer, and the segmented report sections.
4. Data Splits, Accessibility, and Benchmarking
ReXGradient-160K is partitioned into four splits:
- Training: 140,000 studies (238,968 images, 95,716 patients)
- Validation (public): 10,000 studies (17,007 images, 6,964 patients)
- Public test: 10,000 studies (17,029 images, 6,807 patients)
- Private test (ReXrank benchmark): 10,000 studies from 67 sites, reserved for benchmark evaluation
The split design ensures methodological robustness for generalization and benchmarking. The ReXrank platform (https://rexrank.ai) provides a public leaderboard for automated report generation and related tasks. Access to the dataset requires agreement to a HIPAA-compliant license; all patient health information is de-identified or pseudonymized, including temporal (date shifting) and pixel-level redactions where necessary.
5. Evaluation Metrics
While ReXGradient-160K does not introduce new evaluation protocols, it employs standard metrics for text generation and classification in medical imaging:
- BLEU-n: Computed as
where denotes modified precision for -grams, is the reference length, and is the candidate length.
- ROUGE-L: Based on the longest common subsequence (LCS); its F-score is defined as
where and denote the candidate and reference token sequences, respectively.
Clinical concept F1 scores, computed by external tools, are often employed to complement these NLG metrics in benchmark reporting.
6. Research Applications and Limitations
Principal applications include:
- Radiology report generation (especially using encoder-decoder models)
- Abnormality detection and image classification
- Longitudinal disease progression analysis (tracking change across a single patient’s studies)
- Domain generalization and robustness analysis across health systems, hardware platforms, and sites
Notable limitations:
- Exclusivity to U.S. health systems imposes geographic and clinical distribution biases.
- Absence of structured labels, bounding boxes, or manual report annotation may necessitate downstream labeling for tasks beyond text generation or high-level classification.
- De-identification protocols, including date shifting and pixel-level redaction, may obfuscate rare clinical or technical cues such as the presence of hardware or specific acquisition parameters.
- Pediatric cases are underrepresented relative to adult populations given cohort characteristics (Zhang et al., 1 May 2025).
7. Data Access and Licensing
Full dataset access is provided via Hugging Face at https://huggingface.co/datasets/rajpurkarlab/ReXGradient-160K, contingent on acceptance of a legally binding data-use and privacy license. All distributed data satisfy HIPAA requirements, leveraging pseudonymization strategies for names and identifiers, consistent date shifting per patient (365 days), and removal of protected health information from reports and images.
ReXGradient-160K’s unprecedented scale, multi-site representation, and standardized report segmentation render it a substantial contribution for developing and benchmarking next-generation medical imaging machine learning models, particularly within natural language generation and robustness testing contexts (Zhang et al., 1 May 2025).