CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats (2405.19538v2)

Published 29 May 2024 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision LLMs has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1 Models are available at the following URL: https://github.com/Stanford-AIMI/chexpert-plus

Citations (5)

View on Semantic Scholar

Summary

The paper introduces CheXpert Plus, a comprehensive dataset that integrates high-quality chest X-ray images, radiology reports, and patient demographics to enhance AI model training.
The methodology leverages multi-format imaging, detailed text tokenization, and robust PHI de-identification to ensure data integrity and privacy.
The enriched dataset outperforms comparable resources by offering 223,228 images, 187,711 reports, and extensive demographic details to facilitate fair and reliable AI development.

CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Radiology Reports, Patient Demographics and Additional Image Formats

Overview

The paper CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Radiology Reports, Patient Demographics and Additional Image Formats introduces a comprehensive enhancement to the original CheXpert dataset. The release of CheXpert Plus marks a substantial advancement by amalgamating high-quality images, detailed radiology reports, and extensive patient demographic data. This integration provides an enriched dataset for the development of robust, fair, and high-performing AI models in radiology.

Dataset Composition and Features

The CheXpert Plus dataset encompasses:

Images: 223,228 chest X-rays in both DICOM and PNG formats, including 47 DICOM metadata elements.
Reports: 187,711 radiology reports pre-processed into various sections (e.g., History, Findings, Impression) with over 36 million text tokens and 13 million impression tokens, making it the largest publicly available text dataset in radiology.
Demographics: Detailed clinical and socio-economic information on 64,725 unique patients, aiding the development of fair and unbiased models.
Pathology Labels: Labels for 14 pathologies derived from CheXbert, improving upon previous label extraction methods.
RadGraph Annotations: Detailed RadGraph annotations for impressions and findings sections, enhancing NLP application in radiology.
Pretrained Models: Models pre-trained on this dataset for key machine learning tasks are also made available.

Key Numerical Results

CheXpert Plus represents considerable progress in text de-identification, with nearly 1 million PHI spans anonymized. This effort ensures the dataset's ethical acceptance for research purposes. Notably, the dataset holds 36 million text tokens, outpacing all other publicly available radiology datasets.

Comparative Analysis

In a comparison with other significant datasets such as MIMIC-CXR, OpenI, and PadChest, CheXpert Plus is notable for its comprehensive integration of images, reports, and patient metadata. For example, PadChest, although substantial, includes reports in Spanish and lacks the scale of CheXpert Plus in terms of both images and text. MIMIC-CXR stands out as a primary competitor; however, the higher textual detail in CheXpert Plus, particularly in impression sections, and the careful preservation of report structure during de-identification, sets it apart.

Implications and Future Developments

The comprehensive nature of CheXpert Plus offers several potential advancements:

Performance and Robustness: Given its scale and diversity, CheXpert Plus can significantly enhance model performance and generalization across different institutions.
Fairness in AI: The inclusion of varied demographic data facilitates the creation of models that are more equitable across different population subgroups, promoting fairness in AI.
Benchmarking and Validation: The extensive dataset provides a robust foundation for benchmarking and validating new algorithms, particularly those that bridge the gap between NLP and computer vision.

Speculative Future Developments

Vision LLMs (VLMs) and other deep learning paradigms can leverage CheXpert Plus to explore new frontiers in radiology AI. Potential areas of exploration include enhanced multi-modal learning frameworks, real-time diagnostic assistance systems, and improved unsupervised learning techniques. Furthermore, the dataset's inclusion of detailed demographic parameters opens avenues for research into health disparities and biases in diagnostic tools.

Conclusion

CheXpert Plus emerges as a pivotal resource in the domain of radiology AI, characterized by its extensive and meticulously curated dataset of images, textual reports, and patient data. By laying the groundwork for enhanced, fair, and robust AI models, CheXpert Plus is positioned to drive forward the integration of AI in radiological practices, with far-reaching implications for patient care and medical research.

In summary, this paper presents CheXpert Plus as a superior, multifaceted dataset poised to elevate AI research in radiology to new levels of precision, fairness, and application scope. The collaborative efforts manifest in this dataset set a benchmark for future endeavors in medical AI research.

Related Papers

Tweets

https://twitter.com/curtlanglotz/status/1798864778489995407

https://twitter.com/curtlanglotz/status/1799187271985598572

https://twitter.com/knishimae0531/status/1797791939338842618