On the limits of cross-domain generalization in automated X-ray prediction (2002.02497v2)

Published 6 Feb 2020 in eess.IV, cs.LG, q-bio.QM, and stat.ML

Abstract: This large scale study focuses on quantifying what X-rays diagnostic prediction tasks generalize well across multiple different datasets. We present evidence that the issue of generalization is not due to a shift in the images but instead a shift in the labels. We study the cross-domain performance, agreement between models, and model representations. We find interesting discrepancies between performance and agreement where models which both achieve good performance disagree in their predictions as well as models which agree yet achieve poor performance. We also test for concept similarity by regularizing a network to group tasks across multiple datasets together and observe variation across the tasks. All code is made available online and data is publicly available: https://github.com/mlmed/torchxrayvision

Citations (116)

View on Semantic Scholar

Summary

The paper finds that while tasks like Cardiomegaly, Edema, and Effusion generalize well, Pneumonia and Infiltration do not.
The study employs DenseNet architectures and ensemble methods across multiple datasets, revealing significant inter-dataset disagreement via Cohen's Kappa.
It highlights that concept shift from subjective labeling is a major hurdle, advocating for a task-specific approach in deploying AI models in clinical settings.

On the Limits of Cross-Domain Generalization in Automated X-ray Prediction

The paper "On the Limits of Cross-Domain Generalization in Automated X-ray Prediction" addresses the challenges of generalizing deep learning models trained on chest X-ray datasets across various domains. The research scrutinizes the extent to which prediction tasks maintain reliability and consistency when applied to different datasets labeled by distinct institutions.

Core Investigation

The authors conduct a thorough analysis of the generalization performance of X-ray prediction models, focusing on discrepancies arising not from shifts in image data but from shifts in labeling conventions. The primary aim is to elucidate which diagnostic tasks are consistent across multiple datasets and which are not.

Key Findings

Generalization Challenges: The paper finds that tasks such as Cardiomegaly, Edema, and Effusion generalize well across datasets, whereas Pneumonia and Infiltration do not. This indicates that while some medical conditions can be detected reliably by AI models across different datasets, others are impacted significantly by differences in labeling approaches.
Inter-Dataset Agreement: Through the application of Cohen's Kappa score, the research uncovers substantial disagreement among predictions made by models trained on different datasets despite achieving high AUC scores. This suggests that models might perform well yet fail to agree on the same predictions, raising questions about their interpretability and reliability.
Concept Shift vs. Covariate Shift: The paper emphasizes the importance of concept shift—disparities in what is considered the "ground truth" due to subjective labeling—over covariate shift as a major hurdle. This implies that tasks defined by datasets might represent different underlying concepts, challenging the generalization capabilities of deep learning models.

Experimental Framework

The research utilizes DenseNet architectures and ensemble methods to train models across several prominent datasets, including NIH, PadChest, CheXpert, and MIMIC-CXR, encompassing over 200,000 unique chest X-rays. Models are evaluated based on their AUC performance and inter-rater agreement when tested on datasets other than their training set.

Implications

The paper suggests that standardization in automatic labeling might not solve the generalization issues due to the inherent subjectivity and variability in medical interpretations. It recommends that models should be treated as task-specific, with predictions contextualized based on the origin of their training data. This approach would ideally prevent the misapplication of AI tools across different clinical settings.

Future Directions

Future work could explore strategies to reduce interobserver variability, such as advanced consensus-building methods among human annotators, and investigate methods to train models that can adaptively adjust to diverse clinical understandings of medical conditions. Continued refinement of AI model training practices and improved dataset annotation protocols are essential to facilitate better generalization and deployment in varied healthcare environments.

In conclusion, this paper provides a comprehensive assessment of the limitations of cross-domain generalization in automated X-ray prediction, highlighting both the technical and conceptual challenges that remain. It calls for a deeper understanding of the task-specific nature of these models and advocates for a nuanced deployment strategy in clinical practice.

PDF Markdown

Related Papers

GitHub

GitHub - mlmed/torchxrayvision: TorchXRayVision: A library of chest X-ray datasets and models. Classifiers, segmentation, and autoencoders. (921 stars)