Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports (2505.00228v2)

Published 1 May 2025 in cs.CV and eess.IV

Abstract: We present ReXGradient-160K, representing the largest publicly available chest X-ray dataset to date in terms of the number of patients. This dataset contains 160,000 chest X-ray studies with paired radiological reports from 109,487 unique patients across 3 U.S. health systems (79 medical sites). This comprehensive dataset includes multiple images per study and detailed radiology reports, making it particularly valuable for the development and evaluation of AI systems for medical imaging and automated report generation models. The dataset is divided into training (140,000 studies), validation (10,000 studies), and public test (10,000 studies) sets, with an additional private test set (10,000 studies) reserved for model evaluation on the ReXrank benchmark. By providing this extensive dataset, we aim to accelerate research in medical imaging AI and advance the state-of-the-art in automated radiological analysis. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXGradient-160K.

Summary

  • The paper introduces ReXGradient-160K, a large-scale dataset with 160,000 chest radiographs paired with free-text reports from three health systems to improve AI generalization.
  • Its multi-institutional scope provides a diverse benchmark, crucial for evaluating AI models' ability to generalize across varied clinical environments and patient demographics.
  • The dataset includes standardized training, validation, and test splits, preprocessed images, and segmented reports created using GPT-4 for structured analysis and evaluation.

ReXGradient-160K: Implications for AI in Medical Imaging

The paper presents ReXGradient-160K, a substantial advancement in publicly available medical imaging datasets, specifically focusing on chest radiographs paired with detailed free-text reports. The dataset encompasses 160,000 imaging studies from 109,487 patients, collected across three major US health systems and spanning 79 medical sites. This extensive collection facilitates the evaluation and development of AI models aimed at generating radiological reports, addressing the current limitations of single-institution datasets in generalizing across diverse clinical environments.

Contributions to the Field

The dataset's scale and diversity address the demand for standardized benchmarks that incorporate consistency in data splits and evaluation metrics—a critical need in medical AI development. Previous initiatives such as the MIMIC-CXR dataset and CheXpert Plus have laid the groundwork for publicly available resources, offering considerable volumes of data from single institutions. In contrast, ReXGradient-160K encompasses multiple institutions, offering a broader evaluation surface for AI systems and enhancing their generalization capabilities.

This multi-institutional scope opens avenues for assessing AI models under various clinical settings, geographical locations, and demographic distributions, thereby improving the robustness of algorithmic interpretations. The dataset organization allows nuanced analysis through multiple images per paper, coupled with robust de-identification processes ensuring patient privacy.

Technical Details

The dataset is segmented into distinct portions: training, validation, and public test sets, each comprising 140,000, 10,000, and 10,000 studies respectively, with an additional private test set for the ReXrank benchmark. Standardized preprocessing pipelines have converted original DICOM images to PNG format, promoting operational uniformity. Furthermore, reports are structurally segmented into four sections—Indication, Comparison, Findings, and Impression—constructed via GPT-4-assisted extraction and validation to ensure completeness across all entries.

The demographic analysis reveals consistent representation with approximately half of the dataset comprising individuals aged between 40 and 80. There is balanced sex distribution, crucial for balanced model training and evaluation. This statistical consistency extends to radiographic view types, which maintains the reliability of AI model assessments through consistent input variability.

Future Directions

The release of ReXGradient-160K sets a precedent for comprehensive dataset availability, serving as a pivotal resource for accelerating advancements in automated radiological analysis. Future research can leverage this dataset to explore intricate aspects of AI in healthcare, such as enhancing model performance with diverse patient inputs, fine-tuning algorithms for different imaging modalities, and customizing AI outputs per clinical requirements.

The ReXGradient initiative underscores the potential for collaboration across institutions in AI-driven healthcare delivery, emphasizing the importance of multi-site data aggregation for improved medical research. Future developments may include expanding beyond chest radiographs, incorporating varied imaging types and conditions, further optimizing the dataset's utility for expansive medical imaging research.

Overall, ReXGradient-160K is positioned to significantly propel research and applications in AI-facilitated medical imaging, promoting efficiencies, access to expert-level interpretation, and standardized model evaluation across the healthcare sector.

Youtube Logo Streamline Icon: https://streamlinehq.com