Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VinDr-CXR: An open dataset of chest X-rays with radiologist's annotations (2012.15029v3)

Published 30 Dec 2020 in eess.IV

Abstract: Most of the existing chest X-ray datasets include labels from a list of findings without specifying their locations on the radiographs. This limits the development of machine learning algorithms for the detection and localization of chest abnormalities. In this work, we describe a dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, we release 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases. The released dataset is divided into a training set of 15,000 and a test set of 3,000. Each scan in the training set was independently labeled by 3 radiologists, while each scan in the test set was labeled by the consensus of 5 radiologists. We designed and built a labeling platform for DICOM images to facilitate these annotation procedures. All images are made publicly available (https://www.physionet.org/content/vindr-cxr/1.0.0/) in DICOM format along with the labels of both the training set and the test set.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (24)
  1. Ha Q. Nguyen (23 papers)
  2. Khanh Lam (3 papers)
  3. Linh T. Le (4 papers)
  4. Hieu H. Pham (35 papers)
  5. Dat Q. Tran (6 papers)
  6. Dung B. Nguyen (3 papers)
  7. Dung D. Le (20 papers)
  8. Chi M. Pham (1 paper)
  9. Hang T. T. Tong (1 paper)
  10. Diep H. Dinh (1 paper)
  11. Cuong D. Do (7 papers)
  12. Luu T. Doan (1 paper)
  13. Cuong N. Nguyen (4 papers)
  14. Binh T. Nguyen (49 papers)
  15. Que V. Nguyen (1 paper)
  16. Au D. Hoang (1 paper)
  17. Hien N. Phan (1 paper)
  18. Anh T. Nguyen (4 papers)
  19. Phuong H. Ho (1 paper)
  20. Dat T. Ngo (4 papers)
Citations (270)

Summary

VinDr-CXR: An In-Depth Exploration of a New Chest X-ray Dataset

In the domain of computer-aided diagnosis (CAD) for chest radiographs, the development of effective algorithms has been predominantly reliant on robust datasets. "VinDr-CXR: An open dataset of chest X-rays with radiologist's annotations" offers a significant contribution to the field through the introduction of a meticulously curated dataset aimed at enhancing the detection and localization of thoracic diseases and abnormalities. This paper details the creation of the VinDr-CXR dataset, its composition, and intended use, positioning it as a valuable resource for advancing machine learning applications in medical imaging.

Dataset Composition and Annotation Process

The VinDr-CXR dataset comprises over 18,000 manually annotated chest X-ray images derived from a pool of more than 100,000 radiographs collected retrospectively from two prominent hospitals in Vietnam. The dataset adopts a rigorous annotation process involving 17 seasoned radiologists who apply both local and global labels to the images. Local labels denote specific abnormalities localized with bounding boxes, while global labels classify overarching diagnostic impressions. This dual-labeling approach fundamentally differentiates VinDr-CXR from other datasets, offering a comprehensive basis for the development of imaging algorithms capable of both localization and classification.

Within the database, the authors have segmented the images into a training set of 15,000 images and a test set of 3,000 images to facilitate algorithm development and validation. The training images received annotations from three independent radiologists per image, while the test set underwent a more stringent consensus labeling by five radiologists. This ensures a high level of reliability and accuracy, mitigating potential biases inherent in individual assessments.

Comparison with Existing Datasets

The VinDr-CXR dataset contributes a nuanced perspective to existing public datasets by rectifying common pitfalls associated with automated labeling and image classification without localization. Unlike datasets like CheXpert and MIMIC-CXR, which heavily rely on automated rule-based methods for label extraction, leading to inconsistent and erroneous outcomes, VinDr-CXR prioritizes human judgment to enhance label precision.

Moreover, while datasets such as ChestX-ray14 provide large quantities of images without manual annotations, VinDr-CXR emphasizes quality over quantity. The detailed localized annotations offer targeted insights that can significantly improve CAD systems' ability to pinpoint and interpret thoracic abnormalities. Consequently, VinDr-CXR not only complements the scale provided by alternative datasets but also fills an empirical gap critical for advancing real-world clinical applications.

Implications and Future Directions

VinDr-CXR sets a new benchmark for public datasets, underscoring the imperative need for high-quality, locale-specific annotations in medical imagery. Its open availability underlines a commitment to advancing global research efforts, catalyzing innovation in CAD systems particularly suited for thoracic disease detection. Furthermore, the authors make an explicit appeal for users to share their outcomes and methodologies, which could foster a collaborative environment instrumental in overcoming existing challenges in medical imaging.

The future applications of this dataset are vast. By providing the community with a dataset rich in annotated landmarks and verified labels, researchers can explore complex models capable of improving diagnostic accuracy and operational efficiency. Beyond the immediate gains in image classification and abnormality detection, the dataset offers a framework for exploring novel imaging paradigms that intersect with other advanced technologies, such as transfer learning and unsupervised learning models.

As the field pushes toward more refined and holistic CAD systems, the VinDr-CXR dataset emerges as a pivotal resource, grounding future endeavors in a validated, meticulously annotated foundation. This paves the way for innovations that not only thrive in academic evaluations but translate effectively into clinical settings, ultimately elevating patient care standards globally.