Confounding variables can degrade generalization performance of radiological deep learning models (1807.00431v2)

Published 2 Jul 2018 in cs.CV, cs.LG, and stat.ML

Abstract: Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have been promising, but it has not yet been shown that models trained on x-rays from one hospital or one group of hospitals will work equally well at different hospitals. Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize across a variety of hospital systems. A cross-sectional design was used to train and evaluate pneumonia screening CNNs on 158,323 chest x-rays from NIH (n=112,120 from 30,805 patients), Mount Sinai (42,396 from 12,904 patients), and Indiana (n=3,807 from 3,683 patients). In 3 / 5 natural comparisons, performance on chest x-rays from outside hospitals was significantly lower than on held-out x-rays from the original hospital systems. CNNs were able to detect where an x-ray was acquired (hospital system, hospital department) with extremely high accuracy and calibrate predictions accordingly. The performance of CNNs in diagnosing diseases on x-rays may reflect not only their ability to identify disease-specific imaging findings on x-rays, but also their ability to exploit confounding information. Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.

Citations (1,093)

View on Semantic Scholar

Summary

The paper demonstrates that CNNs achieve high internal AUC yet suffer notable performance declines on external datasets.
It reveals that models can almost perfectly identify a radiograph’s origin due to confounding variables unrelated to pathology.
Engineered prevalence experiments show that over-reliance on prevalence statistics undermines genuine pathological feature extraction.

Confounding Variables Can Degrade Generalization Performance of Radiological Deep Learning Models

The paper "Confounding variables can degrade generalization performance of radiological deep learning models" addresses the critical issue of model generalizability in the context of convolutional neural networks (CNNs) used for diagnosing pneumonia from chest X-rays. The paper is a comprehensive cross-sectional analysis involving data from three distinct hospital systems—NIH, Mount Sinai, and Indiana University (IU)—with a dataset comprising 158,323 chest X-rays.

Key Findings

Dataset and Model Performance:
- The paper evaluated the performance of pneumonia screening CNNs trained on data from individual hospital systems and a combined dataset.
- Internal performance, measured using area under the receiver operating characteristic curve (AUC), was more robust compared to external performance. For instance, models trained using Mount Sinai data exhibited an AUC of 0.802 in internal testing but significantly degraded to 0.717 and 0.756 when tested on NIH and IU data, respectively.
- Jointly trained models combining data from NIH and Mount Sinai achieved the highest internal AUC of 0.931, which also outperformed individual datasets. However, this performance dropped significantly to 0.815 on external IU data.
Confounding Variables:
- The paper revealed that CNNs could detect the origin of the radiograph (hospital system) with near-perfect accuracy, using image subregions unrelated to pathology.
- External validation performance was adversely affected by such confounding variables, resulting in miscalibrated predictions when evaluated on data from different hospital systems.
Engineered Prevalence Experiment:
- By artificially varying pneumonia prevalence between datasets, the paper demonstrated an increase in internal AUC at the expense of external generalizability.
- Models trained in settings with significant prevalence imbalances over-relied on the prevalence statistics rather than the actual pathological features, corroborating the findings regarding confounding effects.

Implications and Future Directions

The implications of this research are substantial for both theoretical and practical aspects of deploying deep learning models in clinical settings. From a practical perspective, the robustness of radiological CNNs across various clinical environments is paramount. The paper underscores the necessity for thorough external validation before deploying diagnostic models in real-world settings. Theoretical implications involve understanding the generalization capabilities of CNNs and the influence of confounding variables inherent in medical imaging datasets.

Practical Applications and Recommendations

To address the identified issues, several strategies may be employed:

External Validation: Introducing a standardized framework for external validation across multiple institutions can help assess the true performance and generalization capability of models.
Confounder Mitigation: Techniques such as domain adaptation and transfer learning tailored to mitigate confounding variables should be explored in depth.
Higher Resolution Models: Developing CNN architectures specifically designed for high-resolution medical images, rather than downsampled versions pre-trained on general datasets like ImageNet, could enhance the reliability of pathological predictions.
Integration of Clinical Data: Combining clinical context and patient history with imaging data might provide a more holistic approach, akin to radiologists' practices, thereby improving diagnostic accuracy.

Conclusion

The paper provides a rigorous assessment of the generalizability challenges faced by CNNs in radiological diagnostics. By highlighting the impact of confounding variables and the differential performance across internal and external datasets, the research calls for more cautious and comprehensive validation methodologies. A concerted effort towards these improvements can pave the way for more reliable and deployable diagnostic AI systems in healthcare.

Related Papers

Tweets

https://twitter.com/nickmmark/status/1938629821548167579