MEDFAIR: Benchmarking Fairness for Medical Imaging (2210.01725v2)

Published 4 Oct 2022 in cs.LG, cs.AI, and eess.IV

Abstract: A multitude of work has shown that machine learning-based medical diagnosis systems can be biased against certain subgroups of people. This has motivated a growing number of bias mitigation algorithms that aim to address fairness issues in machine learning. However, it is difficult to compare their effectiveness in medical imaging for two reasons. First, there is little consensus on the criteria to assess fairness. Second, existing bias mitigation algorithms are developed under different settings, e.g., datasets, model selection strategies, backbones, and fairness metrics, making a direct comparison and evaluation based on existing results impossible. In this work, we introduce MEDFAIR, a framework to benchmark the fairness of machine learning models for medical imaging. MEDFAIR covers eleven algorithms from various categories, nine datasets from different imaging modalities, and three model selection criteria. Through extensive experiments, we find that the under-studied issue of model selection criterion can have a significant impact on fairness outcomes; while in contrast, state-of-the-art bias mitigation algorithms do not significantly improve fairness outcomes over empirical risk minimization (ERM) in both in-distribution and out-of-distribution settings. We evaluate fairness from various perspectives and make recommendations for different medical application scenarios that require different ethical principles. Our framework provides a reproducible and easy-to-use entry point for the development and evaluation of future bias mitigation algorithms in deep learning. Code is available at https://github.com/ys-zong/MEDFAIR.

PDF Abstract

Benchmarking Bias Mitigation in Medical Imaging: Insights from MEDFAIR

Introduction

The increasing integration of ML models into medical diagnostics brings to light concerns regarding potential biases against specific subgroups of patients. These biases can detrimentally affect the fairness and ethics of automated decision-making in healthcare. To critically evaluate the effectiveness of existing bias mitigation strategies in the field of medical imaging, we introduce MEDFAIR, a comprehensive framework designed for benchmarking fairness across a diverse set of algorithms, datasets, and sensitive attributes within medical applications.

Fairness in Medicine

Fairness in ML applications for healthcare is an evolving field that addresses the equity of ML model performances across different patient subgroups. Biases in medical imaging can arise from various sources, including data imbalance, class imbalance, and label noise, leading to differential model performances. MEDFAIR evaluates three widely recognized model selection strategies—Overall Performance-based Selection, Minimax Pareto Selection, and Distance to Optimal (DTO)-based Selection—under both in-distribution and out-of-distribution settings. Furthermore, it assesses fairness through the lenses of group fairness and Max-Min fairness, providing a nuanced understanding of the trade-offs involved in optimizing for fairness metrics.

MEDFAIR Framework

The framework encompasses eleven bias mitigation algorithms spanning different categories, ten medical imaging datasets across modalities such as X-ray, CT, and MRI, alongside sensitive attributes like age, sex, race, and skin type. MEDFAIR's adaptability supports the evaluation of bias mitigation strategies under both in-distribution and out-of-distribution scenarios, addressing the challenge of dataset and deployment domain shifts. Through rigorously designed experiments, including statistical analysis across various metrics like AUC, Max-Min fairness, and group fairness, MEDFAIR offers invaluable insights into the extent of biases present in ERM models and the effectiveness of bias mitigation strategies.

Results and Observations

Key findings from over 7,000 models trained using MEDFAIR highlight that biases pervasively affect ERM models across different modalities, showcased by predictive performance gaps. Surprisingly, conventional bias mitigation techniques do not significantly outperform empirical risk minimization models, challenging the presumption of their effectiveness in enhancing fairness. The analyses underline the critical impact of the model selection criterion on fairness outcomes, suggesting a reevaluation of comparative strategies in fairness research. Nonetheless, domain generalization methods, particularly SWAD, emerge as a promising avenue, despite not outperforming ERM with statistical significance.

Discussion and Implications

The absence of a clear "winner" among bias mitigation algorithms prompts a deeper inquiry into the sources and characteristics of bias in medical imaging data. It is apparent that biases stem from multifaceted and sometimes unobservable sources, rendering targeted mitigation strategies insufficient. This revelation calls for a broader approach to enhancing model robustness and fairness, potentially shifting focus towards domain generalization techniques.

MEDFAIR sets a new precedent in fairness benchmarking in medical imaging, offering a comprehensive and adaptable framework for future research. By shedding light on the limitations of current bias mitigation strategies and underscoring the intricacies of fairness in healthcare applications, MEDFAIR paves the way for more nuanced and effective approaches to fair machine learning in medicine.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yongshuo Zong (11 papers)
Yongxin Yang (73 papers)
Timothy Hospedales (101 papers)

Citations (51)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/yongshuozong/status/1808462343954653284