Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COVID-CT-Dataset: A CT Scan Dataset about COVID-19 (2003.13865v3)

Published 30 Mar 2020 in cs.LG, cs.CV, eess.IV, and stat.ML

Abstract: During the outbreak time of COVID-19, computed tomography (CT) is a useful manner for diagnosing COVID-19 patients. Due to privacy issues, publicly available COVID-19 CT datasets are highly difficult to obtain, which hinders the research and development of AI-powered diagnosis methods of COVID-19 based on CTs. To address this issue, we build an open-sourced dataset -- COVID-CT, which contains 349 COVID-19 CT images from 216 patients and 463 non-COVID-19 CTs. The utility of this dataset is confirmed by a senior radiologist who has been diagnosing and treating COVID-19 patients since the outbreak of this pandemic. We also perform experimental studies which further demonstrate that this dataset is useful for developing AI-based diagnosis models of COVID-19. Using this dataset, we develop diagnosis methods based on multi-task learning and self-supervised learning, that achieve an F1 of 0.90, an AUC of 0.98, and an accuracy of 0.89. According to the senior radiologist, models with such performance are good enough for clinical usage. The data and code are available at https://github.com/UCSD-AI4H/COVID-CT

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xingyi Yang (45 papers)
  2. Xuehai He (26 papers)
  3. Jinyu Zhao (23 papers)
  4. Yichen Zhang (157 papers)
  5. Shanghang Zhang (173 papers)
  6. Pengtao Xie (86 papers)
Citations (817)

Summary

COVID-CT-Dataset: A CT Image Dataset about COVID-19

"COVID-CT-Dataset: A CT Image Dataset about COVID-19" is an empirical research paper that presents a publicly available dataset aimed at bolstering AI research for diagnosing COVID-19 through computed tomography (CT) images. The dataset, referred to as COVID-CT, comprises 349 COVID-19-positive CT images obtained from 216 patients and 463 non-COVID-19 CT images. This work addresses the critical shortage of accessible medical datasets during the COVID-19 pandemic by providing a resource that benefits from thorough validation by a seasoned radiologist and experimental studies substantiating its utility in AI model development.

Introduction

COVID-19 posed a monumental challenge in global healthcare partly due to the inadequacies in testing infrastructure. Reverse transcription polymerase chain reaction (RT-PCR), the gold-standard for diagnosis, was in limited supply, triggering the need for alternative diagnostic methods. CT scans emerged as an auxiliary tool for diagnosis, particularly useful during the height of the outbreak when the probability of pneumonia stemming from COVID-19 was high. However, the dissemination of AI methods for CT-based diagnosis has been stunted by privacy concerns and the consequent lack of available data. Addressing this lacuna, the authors released the COVID-CT dataset to facilitate research and development in AI-assisted diagnosis.

The COVID-CT Dataset

The dataset compilation involved extracting CT images related to COVID-19 from 760 preprints on medRxiv and bioRxiv. Employing PyMuPDF for automated extraction and subsequent manual verification, they established a dataset encompassing both COVID-19-positive and non-COVID-19 images. The dataset was meticulously annotated with clinical parameters including patient demographics and medical history.

Validation and Utility Concerns

Feedback from clinical practitioners posed concerns about the possible degradation in image quality due to extraction from PDFs and the single-slice nature of the images. These concerns were assuaged through consultations with a senior radiologist and empirical validation. The validation strategy encompassed the use of original CT images for training deep learning models, revealing that image quality loss does not significantly impair diagnostic accuracy.

Experimental Studies

Study I: Utility Verification

A series of experiments were conducted to confirm the dataset’s utility in developing effective AI models. Comparative evaluation was performed using DenseNet-169 and ResNet-50 architectures. The experiments demonstrated that AI models trained on the full COVID-CT dataset achieved considerably higher performance (accuracy of 79.5%, F1 of 76.0%, and AUC of 90.1%) than those trained on smaller, original CT datasets, thus verifying the dataset’s potential in amplifying model efficacy.

Study II: Performance Enhancement

To exceed the current model performance and attain clinically useful accuracy levels, advanced techniques involving multi-task learning and contrastive self-supervised learning (CSSL) were implemented. Incorporation of lung and lesion masks significantly bolstered model performance by guiding the model’s focus to relevant anatomical regions. CSSL finetuning on top of pre-trained models further improved accuracy, achieving an F1 score of 0.90, AUC of 0.98, and an overall accuracy of 0.89. These high-performance metrics, as affirmed by the consulting radiologist, signify that the models are robust enough for clinical applications.

Implications and Future Directions

The implications of the COVID-CT dataset are multi-fold. Practically, it facilitates rapid advancement in machine learning methods tailored for COVID-19, potentially enhancing diagnostic accuracy in clinical settings with strained radiological expertise. Theoretically, it emphasizes the viability of using lower-quality data sourced from publications for training high-performance models, a significant insight for future dataset acquisition strategies in crisis scenarios. Looking forward, further enlargement of the dataset, including longitudinal studies and cross-validation across diverse populations, can yield models with even greater generalizability and diagnostic accuracy.

Conclusion

This paper provides the computational and clinical research community with a valuable tool in the fight against COVID-19. The COVID-CT dataset stands out as a well-validated public resource that underscores the critical role of AI in augmenting diagnostic processes in healthcare crises. The paper’s thorough experimental backing and its adaptability via advanced machine learning techniques pave the way for future AI-driven diagnostic methodologies.

References

The references for this paper include seminal works on COVID-19 diagnosis using CT imaging, as well as methodologies in machine learning and dataset compilation relevant to the discussed experiments. All references are available in the paper's bibliography.