Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans (2008.06388v4)

Published 14 Aug 2020 in cs.LG, cs.CV, eess.IV, and stat.ML

Abstract: Machine learning methods offer great promise for fast and accurate detection and prognostication of COVID-19 from standard-of-care chest radiographs (CXR) and computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we search EMBASE via OVID, MEDLINE via PubMed, bioRxiv, medRxiv and arXiv for published papers and preprints uploaded from January 1, 2020 to October 3, 2020 which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 61 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher quality model development and well documented manuscripts.

Citations (733)

Summary

  • The paper systematically reviews ML pitfalls in COVID-19 imaging, highlighting issues such as dataset bias and inadequate integration of multi-source data.
  • It applies rigorous evaluation frameworks like RQS and CLAIM to analyze both traditional machine learning and deep learning methodologies.
  • The study recommends external validation, comprehensive documentation, and proper demographic matching to enhance model reliability in clinical settings.

Critical Analysis of Common Pitfalls and Recommendations for Using Machine Learning to Detect and Prognosticate for COVID-19 using Chest Radiographs and CT Scans

Introduction

The COVID-19 pandemic has generated a substantial impetus for the development of ML models as tools for rapid and accurate diagnosis and prognosis using medical imaging modalities such as chest radiographs (CXR) and computed tomography (CT) scans. The proliferation of such approaches reflects the urgency to augment clinical decision-making processes amidst the global health crisis. This paper undertakes a systematic review, identifying prevalent pitfalls in the existing literature and proposing comprehensive recommendations to ameliorate these deficiencies.

Methodological Overview

The review encompasses an evaluation of 61 out of 2,212 initially identified studies, incorporating rigorous quality screening stages. Appropriately, classification tasks focused on both traditional machine learning methods and deep learning (DL) techniques. The review adheres to established frameworks, specifically the Radiomic Quality Score (RQS) for traditional methodologies and the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) for deep learning techniques.

Critical Findings on Model Validity

No models evaluated were deemed appropriate for clinical deployment due to methodological concerns and biases, including but not limited to:

  1. Bias in Small Datasets: High prevalence of small sample sizes without adequate diversity to ensure generalizability.
  2. Variability in Internationally-Sourced Datasets: Inconsistent data acquisition and lack of standardized protocols.
  3. Suboptimal Integration of Multi-Stream Data: Incomplete and unreliable combination of imaging with other clinical data.
  4. Insufficient Documentation: Many studies failed to document critical methodological aspects, such as preprocessing steps, model training configurations, and demographic distributions.

Numerical Metrics and Performance Limitations

Deep learning models typically reported AUCs between 0.70 and 1.00, with high variability attributable to differing class definitions, data partitions, and validation methodologies. Traditional machine learning approaches displayed similar performance heterogeneity, frequently omitting proper feature reduction and validation techniques.

Recommendations

Dataset Utilization:

  • Caution should be exercised with publicly available datasets to avoid biases from source issues and duplication.
  • Researchers should endeavor to match the demographics of their cohorts, avoiding illogical inclusions such as pediatric images in adult-focused studies.

Model Evaluation:

  • External validation using well-curated datasets is imperative to assess generalizability.
  • Confidence intervals should accompany performance metrics to denote uncertainty, especially critical given the small sample sizes typical in COVID-19 datasets.

Reproducibility:

  • Ensure detailed reporting of preprocessing techniques, training protocols, and exact dataset partitions. Use specific version references for datasets and code.

Manuscript Documentation:

  • Follow established frameworks (RQS, CLAIM, TRIPOD, PROBAST) to ensure comprehensive reporting.
  • Provide explicit details regarding data preprocessing, training configurations, sensitivity analyses, and demographics of dataset partitions.

Peer Review Process:

  • Utilize combined expertise from clinical and machine learning backgrounds to identify biases and methodological flaws more effectively.
  • Employ robust checklists to ensure manuscripts meet high standards before publication.

Implications and Future Directions

Increased transparency and methodological rigor are essential for advancing the clinical utility of machine learning models within COVID-19 diagnosis and prognosis frameworks. This synthesis provides benchmarks for future studies aiming to augment the reliability and applicability of AI-driven solutions in medical imaging. Furthermore, fostering collaborations between clinicians and data scientists, along with improving dataset quality and availability, is pivotal in translating these technologies into clinical practice.

The review accentuates that while there is significant enthusiasm and potential in employing AI for COVID-19 imaging, the path to clinical application remains fraught with challenges. Addressing these systematically can catalyze the development of robust, unbiased, and clinically relevant models, positioning AI as a cornerstone in future pandemic responses.