A Benchmark of Medical Out of Distribution Detection (2007.04250v2)

Published 8 Jul 2020 in cs.LG, cs.CV, and stat.ML

Abstract: Motivation: Deep learning models deployed for use on medical tasks can be equipped with Out-of-Distribution Detection (OoDD) methods in order to avoid erroneous predictions. However it is unclear which OoDD method should be used in practice. Specific Problem: Systems trained for one particular domain of images cannot be expected to perform accurately on images of a different domain. These images should be flagged by an OoDD method prior to diagnosis. Our approach: This paper defines 3 categories of OoD examples and benchmarks popular OoDD methods in three domains of medical imaging: chest X-ray, fundus imaging, and histology slides. Results: Our experiments show that despite methods yielding good results on some categories of out-of-distribution samples, they fail to recognize images close to the training distribution. Conclusion: We find a simple binary classifier on the feature representation has the best accuracy and AUPRC on average. Users of diagnostic tools which employ these OoDD methods should still remain vigilant that images very close to the training distribution yet not in it could yield unexpected results.

Citations (54)

View on Semantic Scholar

Summary

A Benchmark of Medical Out-of-Distribution Detection

Medical diagnostics heavily rely on machine learning models trained on specific datasets to predict medical conditions accurately. These models often fail when exposed to data outside their training distribution, necessitating the deployment of Out-of-Distribution Detection (OoDD) systems to prevent erroneous predictions, which could be fatal in medical applications. This paper establishes a benchmark to assess the efficacy of various OoDD methods within medical imaging domains, primarily focusing on chest X-rays, fundus imaging, and histological slides.

Methodology

The authors categorize OoD examples into three distinct use-cases:

Unrelated Inputs: Inputs from completely different image domains that could be processed incorrectly by a model.
Incorrectly Prepared Inputs: Inputs affected by acquisition errors, viewpoint changes, or preprocessing differences.
Selection-Bias Induced Inputs: Inputs with variations not represented within the training data, such as unseen diseases.

The paper evaluates several OoDD methods:

Data-Only Methods: These include K-Nearest Neighbors (KNN) directly applied to images.
Classifier-Only Methods: Leveraging pre-trained classifiers and varied approaches like probability thresholds, Score SVMs, and Binary Classifiers.
Methods with Auxiliary Models: Involving Autoencoder architectures for additional insights during OoDD.

Findings

Across different image domains, methods demonstrated varying degrees of success with OoDD tasks. The Binary Classifier method on feature representations emerged as the most consistent, surpassing more complex models on average accuracy, especially when identifying images close to training distributions. However, OoDD methods struggled with recognizing unseen disease states that are close to the training distribution, leading to recommendations that diagnostic tools employing these methods deploy vigilance when encountering unfamiliar inputs.

Computational Considerations

From a computational standpoint, the paper highlighted efficiency concerns. While classifier-only methods and data-only KNN methods showed promise regarding setup time and computational cost, methods integrating auxiliary models demanded greater resources. In particular, Mahalanobis and its variants outperformed in terms of accuracy but also necessitated more computational power than simpler alternatives.

Implications and Future Directions

This research underscores the need for reliable OoDD systems to ensure the safety and accuracy of medical diagnostic tools. While current methods provide some level of detection, their limitations in handling closely distributed, unseen cases reveal gaps needing further exploration. Future work could focus on improving robustness against selection bias and unseen diseases through advanced methodologies or leveraging additional data sources. The paper paves the way for ongoing discussion on enhancing OoDD methods and adapting them further into real-world medical scenarios.

Overall, the benchmark provides a comprehensive view into the current capabilities and limitations of OoDD techniques in medical imaging applications, emphasizing the need for continuous improvement and adaptation to ensure patient safety and diagnostic accuracy.

Related Papers

YouTube

Show All Videos