Imputation Strategies Under Clinical Presence: Impact on Algorithmic Fairness (2208.06648v3)
Abstract: Machine learning risks reinforcing biases present in data, and, as we argue in this work, in what is absent from data. In healthcare, biases have marked medical history, leading to unequal care affecting marginalised groups. Patterns in missing data often reflect these group discrepancies, but the algorithmic fairness implications of group-specific missingness are not well understood. Despite its potential impact, imputation is often an overlooked preprocessing step, with attention placed on the reduction of reconstruction error and overall performance, ignoring how imputation can affect groups differently. Our work studies how imputation choices affect reconstruction errors across groups and algorithmic fairness properties of downstream predictions.
- Deepjoint: Robust survival modelling under clinical presence shift. arXiv preprint arXiv:2205.13481, 2022.
- A new insight into missing data in intensive care unit patient profiles: observational study. JMIR medical informatics, 7(1):e11605, 2019.
- The prevention and treatment of missing data in clinical trials. New England Journal of Medicine, 367(14):1355–1360, 2012.
- Defining and measuring completeness of electronic health records for secondary use. Journal of biomedical informatics, 46(5):830–836, 2013.
- Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. Journal of the American Medical Informatics Association, 20(1):144–151, 2013.
- Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. Journal of Clinical Epidemiology, 142:218–229, 2022.
- Algorithm fairness in ai for medicine and healthcare. arXiv preprint arXiv:2110.00603, 2021.
- Racial injustice in health care, 2000.
- Sex differences in post cardiac arrest discharge locations. Resuscitation plus, 8:100185, 2021.
- Sex-based disparities in incidence, treatment, and outcomes of cardiac arrest in the united states, 2003–2012. Journal of the American Heart Association, 5(6):e003704, 2016.
- Race, gender, and socioeconomic disparities in ckd in the united states. Journal of the American Society of Nephrology, 19(7):1261–1270, 2008.
- Racial differences in family health history knowledge of type 2 diabetes: exploring the role of interpersonal mechanisms. Translational Behavioral Medicine, 8(4):540–549, 2018.
- Advanced statistics: missing data in clinical research—part 1: an introduction and conceptual framework. Academic Emergency Medicine, 14(7):662–668, 2007.
- Imputation of missing values for electronic health record laboratory data. NPJ digital medicine, 4(1):147, 2021.
- Disparities in stroke incidence contributing to disparities in stroke mortality. Annals of neurology, 69(4):619–627, 2011.
- Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019.
- The analysis of social science data with missing values. Sociological Methods & Research, 18(2-3):292–326, 1989.
- A study of k-nearest neighbour as an imputation method. His, 87(251-260):48, 2002.
- Imputation of clinical covariates in time series. Machine Learning, 110(1):185–248, 2021.
- Missing data: how to best account for what is not known. Jama, 314(9):940–941, 2015.
- Donald B Rubin. Multiple imputation for nonresponse in surveys, volume 81. John Wiley & Sons, 2004.
- Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4):377–399, 2011.
- Characterizing and managing missing structured data in electronic health records: data analysis. JMIR medical informatics, 6(1):e8960, 2018.
- Derrick A Bennett. How can i deal with missing data in my study? Australian and New Zealand journal of public health, 25(5):464–469, 2001.
- A tutorial on sensitivity analyses in clinical trials: the what, why, when and how. BMC medical research methodology, 13(1):1–12, 2013.
- Are missing outcome data adequately handled? a review of published randomized controlled trials in major medical journals. Clinical trials, 1(4):368–376, 2004.
- Prediction performance and fairness heterogeneity in cardiovascular risk models. Scientific Reports, 12(1):12542, 2022.
- Michelle Van Ryn. Research on the provider contribution to race/ethnicity disparities in medical care. Medical care, pages I140–I151, 2002.
- A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5):82–89, 2020.
- Ethical machine learning in healthcare. Annual Review of Biomedical Data Science, 4, 2020.
- Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8:141–163, 2021.
- Fairness definitions explained. In 2018 ieee/acm international workshop on software fairness (fairware), pages 1–7. IEEE, 2018.
- Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
- Counterfactual fairness. Advances in neural information processing systems, 30, 2017.
- Fair inference on outcomes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.
- Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
- Ensuring fairness in machine learning to advance health equity. Annals of internal medicine, 169(12):866–872, 2018.
- A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency, pages 134–148. PMLR, 2018.
- False positives, false negatives, and false analyses: A rejoinder to machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. Fed. Probation, 80:38, 2016.
- Active fairness in algorithmic decision making. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 77–83, 2019.
- Why is my classifier discriminatory? Advances in Neural Information Processing Systems, 31, 2018.
- Can ai help reduce disparities in general medical and mental health care? AMA journal of ethics, 21(2):167–179, 2019.
- Creating fair models of atherosclerotic cardiovascular disease risk. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 271–278, 2019.
- Chexclusion: Fairness gaps in deep chest x-ray classifiers. In BIOCOMPUTING 2021: proceedings of the Pacific symposium, pages 232–243. World Scientific, 2020.
- Hurtful words: quantifying biases in clinical contextual word embeddings. In proceedings of the ACM Conference on Health, Inference, and Learning, pages 110–120, 2020.
- Classification with no discrimination by preferential sampling. In Proc. 19th Machine Learning Conf. Belgium and The Netherlands, volume 1. Citeseer, 2010.
- Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 643–650. IEEE, 2011.
- Big data’s disparate impact. California law review, pages 671–732, 2016.
- Fairness and missing values. arXiv preprint arXiv:1905.12728, 2019.
- Christian Fricke et al. Missing fairness: The discriminatory effect of missing values in datasets on fairness in machine learning, 2020.
- Mining for equitable health: Assessing the impact of missing data in electronic health records. medRxiv, 2022.
- Analyzing the impact of missing values and selection bias on fairness. International Journal of Data Science and Analytics, 12(2):101–119, 2021.
- The role of decision support systems in attenuating racial biases in healthcare delivery. Management science, 66(11):5171–5181, 2020.
- Impact of imputation strategies on fairness in machine learning. Journal of Artificial Intelligence Research, 74:1011–1035, 2022.
- Yiliang Zhang and Qi Long. Fairness in missing data imputation. arXiv preprint arXiv:2110.12002, 2021.
- Fairness without imputation: A decision tree approach for fair prediction with missing values. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9558–9566, 2022.
- The challenge of imputation in explainable artificial intelligence models. arXiv preprint arXiv:1907.12669, 2019.
- A review of challenges and opportunities in machine learning for health. AMIA Summits on Translational Science Proceedings, 2020:191, 2020.
- Potential biases in machine learning algorithms using electronic health record data. JAMA internal medicine, 178(11):1544–1547, 2018.
- Learning from data with structured missingness. Nature Machine Intelligence, 5(1):13–23, 2023.
- Issues of unequal access to public health in india. Frontiers in public health, 3:245, 2015.
- Health insurance and access to health care in the united states. Annals of the New York Academy of Sciences, 1136(1):149–160, 2008.
- Ala Szczepura. Access to health care for ethnic minority populations. Postgraduate medical journal, 81(953):141–147, 2005.
- Do electronic health records affect quality of care? evidence from the hitech act. Information Systems Research, 30(1):306–318, 2019.
- Delayed access to health care: risk factors, reasons, and consequences, 1991.
- Access is necessary but not sufficient: factors influencing delay and avoidance of health care services. MDM Policy & Practice, 3(1):2381468318760298, 2018.
- Breast cancer in men: are there similarities with breast cancer in women? Gynecologie, Obstetrique & Fertilite, 34(5):413–419, 2006.
- Sharon H Giordano. Breast cancer in men. New England Journal of Medicine, 378(24):2311–2320, 2018.
- The lancet women and cardiovascular disease commission: reducing the global burden by 2030. The Lancet, 397(10292):2385–2438, 2021.
- Skin cancer in skin of color. Journal of the American Academy of Dermatology, 55(5):741–760, 2006.
- Missed diagnosis or misdiagnosis? girls and women on the autism spectrum. Good Autism Practice (GAP), 12(1):34–41, 2011.
- Male breast cancer. The journal of breast health, 12(1):1, 2016.
- Cardiovascular disease in women: clinical perspectives. Circulation research, 118(8):1273–1293, 2016.
- Global prevalence of autism: A systematic review update. Autism Research, 15(5):778–790, 2022.
- Sex and gender: modifiers of health, disease, and medicine. The Lancet, 396(10250):565–582, 2020.
- Sex differences in heart failure symptoms and factors associated with heart failure symptoms. Journal of Cardiovascular Nursing, 34(4):306–312, 2019.
- Gender differences in patients with heart failure. European Journal of Cardiovascular Nursing, 2(1):7–18, 2003.
- Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. Bmj, 361, 2018.
- Informative presence and observation in routine health data: A review of methodology for clinical risk prediction. Journal of the American Medical Informatics Association, 2020.
- Strategies for handling missing data in electronic health record derived data. Egems, 1(3), 2013.
- Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC medical informatics and decision making, 14(1):51, 2014.
- Sick patients have more data: the non-random completeness of electronic health records. In AMIA Annual Symposium Proceedings, volume 2013, page 1472. American Medical Informatics Association, 2013.
- Determinants and extent of weight recording in uk primary care: an analysis of 5 million adults’ electronic health records from 2000 to 2017. BMC medicine, 17(1):1–11, 2019.
- Graphical models for processing missing data. Journal of the American Statistical Association, 116(534):1023–1037, 2021.
- Sex differences in heart failure. European heart journal, 40(47):3859–3868c, 2019.
- Structural brain correlates of adolescent resilience. Journal of Child Psychology and Psychiatry, 57(11):1287–1296, 2016.
- Utility of single versus sequential measurements of risk factors for prediction of stroke in chinese adults. Scientific reports, 11(1):17575, 2021.
- Predictors of weight loss after bariatric surgery—a cross-disciplinary approach combining physiological, social, and psychological measures. International Journal of Obesity, 44(11):2291–2302, 2020.
- Missing covariate data in medical research: to impute is better than to ignore. Journal of clinical epidemiology, 63(7):721–727, 2010.
- Advanced statistics: missing data in clinical research—part 2: multiple imputation. Academic Emergency Medicine, 14(7):669–678, 2007.
- Multiple imputation in public health research. Statistics in medicine, 20(9-10):1541–1549, 2001.
- Rolf HH Groenwold. Informative missingness in electronic health record systems: the curse of knowing. Diagnostic and prognostic research, 4(1):1–6, 2020.
- Directly modeling missing data in sequences with rnns: Improved classification of clinical time series. In Machine Learning for Healthcare Conference, pages 253–270, 2016.
- Handling missing values when applying classification models. Journal of Machine Learning Research, 2007.
- Missing data should be handled differently for prediction than for description or causal explanation. Journal of Clinical Epidemiology, 125:183–187, 2020.
- Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences, 117(23):12592–12594, 2020.
- Peeking into a black box, the fairness and generalizability of a mimic-iii benchmarking model. Scientific Data, 9(1):1–13, 2022.
- Improving the fairness of chest x-ray classifiers. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann, editors, Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 204–233. PMLR, 07–08 Apr 2022.
- Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
- Mimic-extract: A data extraction, preprocessing, and representation pipeline for mimic-iii. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 222–235, 2020.
- Deep parametric time-to-event regression with time-varying covariates. In Russell Greiner, Neeraj Kumar, Thomas Alexander Gerds, and Mihaela van der Schaar, editors, Proceedings of AAAI Spring Symposium on Survival Prediction - Algorithms, Challenges, and Applications 2021, volume 146 of Proceedings of Machine Learning Research, pages 184–193. PMLR, 22–24 Mar 2021.
- Predicting risk for trauma patients using static and dynamic information from the mimic iii database. Plos one, 17(1):e0262523, 2022.
- Association of sex with clinical outcome in critically ill sepsis patients: a retrospective analysis of the large clinical database mimic-iii. Shock (Augusta, Ga.), 52(2):146, 2019.
- Logistic Regression, pages 273–301. Humana Press, Totowa, NJ, 2007.
- Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. Journal of the American Medical Informatics Association, 24(1):198–208, 2017.