MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs (1901.07042v5)

Published 21 Jan 2019 in cs.CV, cs.LG, and eess.IV

Abstract: Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's thorax, but requiring specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. However, a key challenge in the development of these techniques is the lack of sufficient data. Here we describe MIMIC-CXR-JPG v2.0.0, a large dataset of 377,110 chest x-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Images are provided with 14 labels derived from two natural language processing tools applied to the corresponding free-text radiology reports. MIMIC-CXR-JPG is derived entirely from the MIMIC-CXR database, and aims to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels. All images have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage a wide range of research in medical computer vision.

Citations (696)

View on Semantic Scholar

Summary

The paper presents a large, de-identified dataset of 377,110 chest x-rays labeled via NLP for improved automated pathology detection.
It employs open-source NLP tools, NegBio and CheXpert, to extract 14 critical pathology labels from radiology reports, validated by expert review.
The dataset’s JPEG format enhances accessibility compared to DICOM, enabling standardized benchmarks for training and evaluating AI models in medical imaging.

MIMIC-CXR-JPG: A Comprehensive Dataset for Advancing Computer Vision in Medical Imaging

The paper "MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs" introduces a substantial dataset aimed at enhancing research in the automated analysis of chest radiographs. Developed by researchers from MIT and collaborating institutions, this dataset consists of 377,110 chest x-ray images derived from 227,827 imaging studies taken from Beth Israel Deaconess Medical Center over a five-year period. The dataset addresses a critical need in medical imaging for extensive labeled datasets by providing 14 labels derived via natural language processing tools applied to the associated radiology reports.

Dataset Characteristics

MIMIC-CXR-JPG provides a JPEG version of the images originally in DICOM format, commonly utilized in the clinical setting. The conversion to JPEG facilitates broader accessibility for researchers, particularly those outside medical domains who may find DICOM files complex. This translated dataset maintains essential features while being efficiently compressed, enabling analysis with general-purpose computer vision tools. All images are also meticulously de-identified to adhere to HIPAA regulations, ensuring patient privacy.

Labeling and Validation

The dataset employs two open-source NLP tools, NegBio and CheXpert, for deriving labels from radiology reports. These tools provide significant insights into prevalent pathologies and conditions evident in the radiographs, including cardiomegaly and pleural effusion. The authors have validated the labels by manually reviewing a subset of 687 reports, engaging an expert radiologist to assure correctness. This validation process strengthens the reliability of the dataset for training AI models.

Technical Evaluation

Tables illustrate the performance evaluation of each label's presence, uncertainty, and negation, indicating the robustness of the applied NLP tools. Precision scores for common pathologies, such as pneumothorax and pleural effusion, showcase high agreement with manual annotations, demonstrating the utility of the dataset in supporting accurate AI models.

Dataset Splits and Licenses

To facilitate consistent model evaluation, the data is partitioned into training, validation, and test sets. The distribution allows researchers to replicate experimental conditions and ensure fair evaluations. Importantly, the test set is withheld from public release to prevent overfitting across studies. Usage of the dataset is granted under specific terms, aligning with ethical research practices.

Implications and Future Directions

The release of the MIMIC-CXR-JPG dataset will likely stimulate significant advancements in automated radiograph analysis. By standardizing the JPEG format of chest radiographs, it sets a benchmark for developing and comparing algorithmic approaches across various medical computer vision tasks. The dataset is poised to facilitate deep learning applications in detecting pathologies, potentially improving diagnostic efficiency and accuracy.

As computer vision and AI continue to evolve, this dataset will provide a foundational resource, encouraging novel methodologies and promoting improvements in health outcomes. Future work could focus on expanding the dataset with additional modalities and investigating the integration of multi-modal data for comprehensive AI-driven diagnostic tools. This paper underscores the critical intersection of data availability and healthcare innovation, propelling forward the domain of medical imaging research.

PDF Markdown