BIMCV COVID-19+: a large annotated dataset of RX and CT images from COVID-19 patients

Published 1 Jun 2020 in eess.IV, cs.CV, and cs.LG | (2006.01174v3)

Abstract: This paper describes BIMCV COVID-19+, a large dataset from the Valencian Region Medical ImageBank (BIMCV) containing chest X-ray images CXR (CR, DX) and computed tomography (CT) imaging of COVID-19+ patients along with their radiological findings and locations, pathologies, radiological reports (in Spanish), DICOM metadata, Polymerase chain reaction (PCR), Immunoglobulin G (IgG) and Immunoglobulin M (IgM) diagnostic antibody tests. The findings have been mapped onto standard Unified Medical Language System (UMLS) terminology and cover a wide spectrum of thoracic entities, unlike the considerably more reduced number of entities annotated in previous datasets. Images are stored in high resolution and entities are localized with anatomical labels and stored in a Medical Imaging Data Structure (MIDS) format. In addition, 10 images were annotated by a team of radiologists to include semantic segmentation of radiological findings. This first iteration of the database includes 1,380 CX, 885 DX and 163 CT studies from 1,311 COVID-19+ patients. This is, to the best of our knowledge, the largest COVID-19+ dataset of images available in an open format. The dataset can be downloaded from http://bimcv.cipf.es/bimcv-projects/bimcv-covid19.

Abstract PDF Upgrade to Chat

Citations (221)

View on Semantic Scholar

Summary

The paper presents the creation of a large imaging dataset with over 2,400 COVID-19 positive studies including chest X-rays and CT scans.
It details rigorous radiologist-led semantic segmentation and standardized mapping to UMLS lexicons to enhance AI diagnostic capabilities.
The open-access dataset, with ongoing updates, enables advanced AI research for automated diagnosis, prognosis, and treatment planning.

An Analysis of the BIMCV COVID-19+ Annotated Imaging Dataset for AI Applications

The paper delineates the creation and potential utility of the BIMCV COVID-19+ dataset, a comprehensive repository of radiological images (chest X-rays and CT scans) specifically curated to aid COVID-19 research. This dataset represents a significant step in bolstering research efforts in data-driven diagnosis and treatment of COVID-19, particularly using AI and machine learning models. The dataset is publicly available, promising to be a valuable resource for researchers worldwide.

Dataset Composition

The BIMCV COVID-19+ dataset comprises 1,380 chest X-rays (CR), 885 digital X-rays (DX), and 163 CT scan studies from 1,311 patients confirmed COVID-19 positive. The imaging data is accompanied by detailed radiological reports, PCR and antibody test results, and patient demographic information. The inclusion of high-resolution images with localization annotations and the mapping of diagnostic findings to Unified Medical Language System (UMLS) lexicons make this dataset particularly rich in context and utility.

Noteworthy is the semantic segmentation carried out by radiologists on a subset of the images to mark regions of interest (e.g., ground glass opacities), which are associated with COVID-19 lung pathologies. As such, the dataset does not exclusively focus on COVID-19+ labels but encompasses a broader spectrum of thoracic entities, enhancing its applicability for a range of thoracic pathology identification tasks using AI.

Implications for AI Research

This dataset offers considerable potential for advancing artificial intelligence applications in medical imaging. The diversity and breadth of features in the dataset make it an excellent resource for training machine learning models for tasks such as automated diagnosis, prognosis prediction, and treatment planning in respiratory diseases, inclusive of but not limited to COVID-19.

Key Aspects of the Dataset:

Scale and Accessibility: As one of the largest open COVID-19 imaging datasets, it provides a rich source of visual and metadata, facilitating the development of generalized AI models.
Detailed Annotation: The detailed and standardized annotation enhances the dataset usability in supervised learning workflows, aiding models that depend on contextual information like clinical findings and anatomical localizations.
Incremental Nature: The dataset is structured to grow over time. This continuous expansion ensures that models trained on this data can be iteratively improved and tested on new and evolving data distributions.

Technical and Ethical Considerations

The dataset meets essential ethical standards concerning privacy and data protection. An extensive process of ethical approval, data anonymization, and adherence to data protection laws ensures compliance with regulatory standards, giving users confidence in the ethical stewardship of the data.

The technical validation conducted, including neural networks for image preprocessing tasks such as image projection orientation identification, underscores the dataset's readiness for AI applications. Moreover, the application of neural networks for text processing of Spanish radiological reports (using Named Entity Recognition) illustrates cutting-edge techniques being utilized to maximize the dataset's utility.

Future Directions

As the dataset is updated and expanded, further opportunities arise for more nuanced analyses, potentially extending to the study of post-COVID conditions and other comorbidities detectable via imaging. The open-access nature of BIMCV COVID-19+ not only facilitates immediate scientific exploration but also sets a precedent for future medical imaging datasets in other domains, fostering collaborative research efforts globally.

In conclusion, the BIMCV COVID-19+ dataset stands out as a highly valuable resource for researchers aiming to enhance AI capabilities in medical imaging, specifically for COVID-19 and thoracic disease diagnosis. Its comprehensive nature and open access promise to foster advancements across multiple areas of AI and medical research.

Markdown