Benchmarking Bias Mitigation in Medical Imaging: Insights from MEDFAIR
Introduction
The increasing integration of ML models into medical diagnostics brings to light concerns regarding potential biases against specific subgroups of patients. These biases can detrimentally affect the fairness and ethics of automated decision-making in healthcare. To critically evaluate the effectiveness of existing bias mitigation strategies in the field of medical imaging, we introduce MEDFAIR, a comprehensive framework designed for benchmarking fairness across a diverse set of algorithms, datasets, and sensitive attributes within medical applications.
Fairness in Medicine
Fairness in ML applications for healthcare is an evolving field that addresses the equity of ML model performances across different patient subgroups. Biases in medical imaging can arise from various sources, including data imbalance, class imbalance, and label noise, leading to differential model performances. MEDFAIR evaluates three widely recognized model selection strategies—Overall Performance-based Selection, Minimax Pareto Selection, and Distance to Optimal (DTO)-based Selection—under both in-distribution and out-of-distribution settings. Furthermore, it assesses fairness through the lenses of group fairness and Max-Min fairness, providing a nuanced understanding of the trade-offs involved in optimizing for fairness metrics.
MEDFAIR Framework
The framework encompasses eleven bias mitigation algorithms spanning different categories, ten medical imaging datasets across modalities such as X-ray, CT, and MRI, alongside sensitive attributes like age, sex, race, and skin type. MEDFAIR's adaptability supports the evaluation of bias mitigation strategies under both in-distribution and out-of-distribution scenarios, addressing the challenge of dataset and deployment domain shifts. Through rigorously designed experiments, including statistical analysis across various metrics like AUC, Max-Min fairness, and group fairness, MEDFAIR offers invaluable insights into the extent of biases present in ERM models and the effectiveness of bias mitigation strategies.
Results and Observations
Key findings from over 7,000 models trained using MEDFAIR highlight that biases pervasively affect ERM models across different modalities, showcased by predictive performance gaps. Surprisingly, conventional bias mitigation techniques do not significantly outperform empirical risk minimization models, challenging the presumption of their effectiveness in enhancing fairness. The analyses underline the critical impact of the model selection criterion on fairness outcomes, suggesting a reevaluation of comparative strategies in fairness research. Nonetheless, domain generalization methods, particularly SWAD, emerge as a promising avenue, despite not outperforming ERM with statistical significance.
Discussion and Implications
The absence of a clear "winner" among bias mitigation algorithms prompts a deeper inquiry into the sources and characteristics of bias in medical imaging data. It is apparent that biases stem from multifaceted and sometimes unobservable sources, rendering targeted mitigation strategies insufficient. This revelation calls for a broader approach to enhancing model robustness and fairness, potentially shifting focus towards domain generalization techniques.
MEDFAIR sets a new precedent in fairness benchmarking in medical imaging, offering a comprehensive and adaptable framework for future research. By shedding light on the limitations of current bias mitigation strategies and underscoring the intricacies of fairness in healthcare applications, MEDFAIR paves the way for more nuanced and effective approaches to fair machine learning in medicine.