Diagnosing failures of fairness transfer across distribution shift in real-world medical settings
Abstract: Diagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is encountering in practice. In this work, we adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts. Using our approach in two medical applications, we show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Based on these results, we discuss potential remedies at each step of the machine learning pipeline.
- Fairness and robustness in invariant learning: A case study in toxicity classification. November 2020.
- A reductions approach to fair classification. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 60–69. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/agarwal18a.html.
- A near-optimal algorithm for debiasing trained machine learning models. In 35th Conference on Neural Information Processing Systems, 2021. URL https://arxiv.org/abs/2106.12887.
- Permutation weighting. 139:331–341, 2021.
- Fairness and machine learning. fairmlbook.org, 2019.
- A theory of learning from different domains. Mach. Learn., 79(1):151–175, May 2010.
- Walker Ian Castro, Daniel C. and Ben Glocker. Causality matters in medical imaging. Nature Communications, 11, 2020.
- Anomaly detection: A survey. ACM Comput. Surv., 41(3):1–58, July 2009.
- A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J. Chronic Dis., 40(5):373–383, 1987.
- Ethical machine learning in healthcare. Annu Rev Biomed Data Sci, 4:123–144, July 2021.
- Technical challenges for training fair neural networks. February 2021.
- Silvia Chiappa. Path-Specific counterfactual fairness. AAAI, 33(01):7801–7808, July 2019.
- Fair transfer learning with missing protected attributes. In AAAI/ACM Conference on AI, Ethics, and Society, pages 91–98, 2019.
- Probabilistic Networks and Expert Systems, Exact Computational Methods for Bayesian Networks. Springer-Verlag, 2007.
- Causal modeling for fairness in dynamical systems. 119:2185–2195, 2020.
- Environment inference for invariant learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2189–2200. PMLR, 2021.
- Underspecification presents challenges for credibility in modern machine learning. November 2020.
- Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. J. Clin. Epidemiol., 45(6):613–619, June 1992.
- Fair and robust classification under sample selection bias. In ACM International Conference on Information & Knowledge Management, pages 2999–3003, 2021.
- Fairness through awareness. In Innovations in Theoretical Computer Science, 2012.
- Comorbidity measures for use with administrative data. Med. Care, 36(1):8–27, January 1998.
- A brief review of domain adaptation. October 2020.
- Partial identifiability in discrete data with measurement error. In Proceedings of 37th Conference on Uncertainty in Artificial Intelligence, July 2021.
- The clinician and dataset shift in artificial intelligence. N. Engl. J. Med., 385(3):283–286, July 2021.
- A unified view of label shift estimation. In H Larochelle, M Ranzato, R Hadsell, M F Balcan, and H Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3290–3300. Curran Associates, Inc., 2020.
- PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation, 101(23):E215–20, June 2000.
- A kernel Two-Sample test. J. Mach. Learn. Res., 13(25):723–773, 2012.
- Measuring potentially avoidable hospital readmissions. J. Clin. Epidemiol., 55(6):573–587, June 2002.
- Kosuke Imai and David A van Dyk. Causal inference with general treatment regimes. J. Am. Stat. Assoc., 99(467):854–866, September 2004.
- MIMIC-III, a freely accessible critical care database. Sci Data, 3:160035, May 2016.
- The MIMIC code repository: enabling reproducibility in critical care research. J. Am. Med. Inform. Assoc., 25(1):32–39, January 2018.
- Residual unfairness in fair machine learning from prejudiced data. In International Conference on Machine Learning, pages 2439–2448, 2018.
- WILDS: A benchmark of in-the-wild distribution shifts. December 2020.
- Big transfer (BiT): General visual representation learning. In ECCV, 2020.
- Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
- Discriminatory transfer. July 2017.
- Risk adjustment performance of charlson and elixhauser comorbidities in ICD-9 and ICD-10 administrative databases. BMC Health Serv. Res., 8:12, January 2008.
- Addressing extreme propensity scores via the overlap weights. Am. J. Epidemiol., 188(1):250–257, January 2019.
- Detecting and correcting for label shift with black box predictors. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3122–3130. PMLR, 2018.
- A deep learning system for differential diagnosis of skin diseases. Nat. Med., 26(6):900–908, June 2020.
- Domain adaptation by using causal inference to predict invariant conditional distributions. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Novelty detection: a review—part 1: statistical approaches. Signal Processing, 83(12):2481–2497, December 2003.
- A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019.
- Population-aware hierarchical bayesian domain adaptation via multi-component invariant learning. In Proceedings of the ACM Conference on Health, Inference, and Learning, CHIL ’20, pages 182–192, New York, NY, USA, April 2020. Association for Computing Machinery.
- Joint causal inference from multiple contexts. J. Mach. Learn. Res., 21(99):1–108, 2020.
- Fairness through robustness: Investigating robustness disparity in deep learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages 466–477, New York, NY, USA, March 2021. Association for Computing Machinery.
- Feature robustness in non-stationary health records: Caveats to deployable model performance in common clinical machine learning tasks. In Finale Doshi-Velez, Jim Fackler, Ken Jung, David Kale, Rajesh Ranganath, Byron Wallace, and Jenna Wiens, editors, Proceedings of the 4th Machine Learning for Healthcare Conference, volume 106 of Proceedings of Machine Learning Research, pages 381–405, Ann Arbor, Michigan, 2019. PMLR.
- Fairness in machine learning. In L. Oneto, N. Navarin, A. Sperduti, and D. Anguita, editors, Recent Trends in Learning From Data. Studies in Computational Intelligence, volume 896. Springer, Cham, 2020.
- Learning fair and transferable representations, 2019.
- A survey on transfer learning. IEEE Trans. Knowl. Data Eng., 22(10):1345–1359, October 2010.
- Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., 1988.
- Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- An empirical characterization of fair machine learning for clinical risk prediction. J. Biomed. Inform., 113:103621, January 2021.
- Physician risk assessment and APACHE scores in cardiac care units. Clin. Cardiol., 22(5):366–368, May 1999.
- Failing loudly: An empirical study of methods for detecting dataset shift. In 33rd Conference on Neural Information Processing Systems, 2019.
- Scalable and accurate deep learning with electronic health records. NPJ Digit Med, 1:18, May 2018.
- The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, April 1983.
- Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pages 2662–2670, California, August 2017. International Joint Conferences on Artificial Intelligence Organization.
- Does your dermatology classifier know what it doesn’t know? detecting the Long-Tail of unseen conditions. April 2021a.
- Multitask prediction of organ dysfunction in the intensive care unit using sequential subnetwork routing. J. Am. Med. Inform. Assoc., June 2021b.
- Re-imagining algorithmic fairness in india and beyond. January 2021.
- On causal and anticausal learning. In International Conference on Machine Learning, pages 459–466, 2012.
- Transfer of machine learning fairness across domains. June 2019.
- Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
- GradMask: Reduce overfitting by regularizing saliency. April 2019.
- Fairness violations and mitigation under covariate shift. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 3–13. Association for Computing Machinery, New York, NY, USA, March 2021.
- Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database. PLOS Digital Health, 1(4):e0000023, April 2022.
- Fairness warnings and fair-MAML: learning fairly with minimal data. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, pages 200–209, New York, NY, USA, January 2020. Association for Computing Machinery.
- Evaluating model robustness and stability to dataset shift. October 2020.
- A distributionally robust approach to fair classification, 2020.
- A clinically applicable approach to continuous prediction of future acute kidney injury. Nature, 572(7767):116–119, August 2019.
- Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc., 16(6):2765–2787, June 2021.
- A modification of the elixhauser comorbidity measures into a point system for hospital death using administrative data. Med. Care, 47(6):626–633, June 2009.
- Counterfactual invariance to spurious correlations: Why and how to pass stress tests. May 2021.
- Saliency is a possible red herring when diagnosing poor generalization. October 2019.
- Generalizing to unseen domains: A survey on domain generalization. March 2021.
- MIMIC-Extract: A data extraction, preprocessing, and representation pipeline for MIMIC-III. July 2019.
- Towards fairness in visual recognition: Effective strategies for bias mitigation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8916–8925, June 2020.
- A survey of transfer learning. Journal of Big Data, 3(1):1–40, May 2016.
- A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions. J. Am. Med. Inform. Assoc., 21(4):699–706, July 2014.
- Electronic health record alerts for acute kidney injury: multicenter, randomized clinical trial. BMJ, 372:m4786, January 2021.
- To be robust or to be fair: Towards fairness in adversarial training. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11492–11501. PMLR, 2021.
- Individual fairness revisited: Transferring techniques from adversarial robustness. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, California, July 2020. International Joint Conferences on Artificial Intelligence Organization.
- Training individually fair ML models with sensitive subspace robustness. June 2019.
- Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In International Conference on World Wide Web, 2017.
- Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med., 15(11):e1002683, November 2018.
- An empirical framework for domain generalization in clinical settings. In Proceedings of the Conference on Health, Inference, and Learning, CHIL ’21, pages 279–290, New York, NY, USA, April 2021. Association for Computing Machinery.
- Fair meta-learning for few-shot classification. In IEEE International Conference on Knowledge Graph, pages 275–282, 2020.
- Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.