Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In the Name of Fairness: Assessing the Bias in Clinical Record De-identification (2305.11348v2)

Published 18 May 2023 in cs.LG, cs.CL, cs.CR, and cs.CY

Abstract: Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (142)
  1. [n.d.]. De-identifying sensitive data — Data Loss Prevention Documentation — Google Cloud — cloud.google.com. https://cloud.google.com/dlp/docs/deidentify-sensitive-data. [Accessed 24-November-2022].
  2. [n.d.]. De-identifying sensitive data  —  cloud healthcare API  —  google cloud. https://cloud.google.com/healthcare-api/docs/how-tos/deidentify. [Accessed 24-November-2022].
  3. [n.d.]. Decennial Census Surname Files (2010, 2000) — census.gov. https://www.census.gov/data/developers/data-sets/surnames.html. [Accessed 30-June-2022].
  4. [n.d.]. Detect PHI - Amazon Comprehend Medical — docs.aws.amazon.com. https://docs.aws.amazon.com/comprehend-medical/latest/dev/textanalysis-phi.html. [Accessed 24-November-2022].
  5. [n.d.]. HealthVerity Census – Real-Time Patient Identity Resolution Technology — healthverity.com. https://healthverity.com/solutions/healthverity-census/. [Accessed 06-Feb-2023].
  6. [n.d.]. How to work with the GPT-35-Turbo and GPT-4 models - Azure OpenAI Service — learn.microsoft.com. https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chatgpt?tabs=python-new&pivots=programming-language-chat-completions. [Accessed 29-November-2023].
  7. [n.d.]. Popular Baby Names — ssa.gov. https://www.ssa.gov/oact/babynames/limits.html. [Accessed 30-June-2022].
  8. [n.d.]. Privacy Analytics - Software to Anonymize Text — privacy-analytics.com. https://privacy-analytics.com/health-data-privacy/health-data-software/software-to-anonymize-text/. [Accessed 06-Feb-2023].
  9. [n.d.]. Using the healthcare natural language API — cloud healthcare API — google cloud. https://cloud.google.com/healthcare-api/docs/how-tos/nlp. [Accessed 24-November-2022].
  10. [n.d.]. What is the Personally Identifying Information (PII) detection feature in Azure Cognitive Service for Language? - Azure Cognitive Services — learn.microsoft.com. https://learn.microsoft.com/en-us/azure/cognitive-services/language-service/personally-identifiable-information/overview. [Accessed 24-November-2022].
  11. The MITRE Identification Scrubber Toolkit: design, training, and assessment. International journal of medical informatics (2010).
  12. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society.
  13. Write It Like You See It: Detectable Differences in Clinical Notes by Race Lead to Differential Model Recommendations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society.
  14. Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 1998–2022.
  15. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics.
  16. Fairness in machine learning for healthcare. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  17. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (demonstrations).
  18. Contextual string embeddings for sequence labeling. In Proceedings of the 27th international conference on computational linguistics.
  19. Keith B Anderson. 2005. Identity theft: Does the risk vary with demographics? Federal Trade Commission, Bureau of Economics Working Paper (2005).
  20. Keith B Anderson. 2006. Who are the victims of identity theft? The effect of demographics. Journal of Public Policy & Marketing (2006).
  21. Andrew L Beam and Isaac S Kohane. 2018. Big data and machine learning in health care. Jama (2018).
  22. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC medical informatics and decision making (2006).
  23. CB3-01: comparison of ethnicity and race categorization in electronic medical records and by Self-report. Clinical Medicine & Research (2012).
  24. Marianne Bertrand and Sendhil Mullainathan. 2004. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American economic review (2004).
  25. Putting fairness principles into practice: Challenges, metrics, and improvements. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society.
  26. Jayadev Bhaskaran and Isha Bhallamudi. 2019. Good Secretaries, Bad Truck Drivers? Occupational Gender Stereotypes in Sentiment Analysis. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing.
  27. Catherine Bliss. 2012. Race decoded: The genomic fight for social justice. Stanford University Press.
  28. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  29. Su Lin Blodgett and Brendan O’Connor. 2017. Racial disparity in natural language processing: A case study of social media african-american english. arXiv preprint arXiv:1707.00061 (2017).
  30. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems (2016).
  31. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference.
  32. Simone Browne. 2015. Dark matters: On the surveillance of blackness. Duke University Press.
  33. Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency.
  34. Aidan Byrne and Alessandra Tanesini. 2015. Instilling new habits: addressing implicit bias in healthcare professionals. Advances in Health Sciences Education (2015).
  35. Semantics derived automatically from language corpora contain human-like biases. Science (2017).
  36. Generalizability of an acute kidney injury prediction model across health systems. Nature Machine Intelligence (2022).
  37. Inferring gender from name phonology. Journal of Experimental Psychology: General (1999).
  38. Kaytlin Chaloner and Alfredo Maldonado. 2019. Measuring gender bias in word embeddings across domains and discovering new gender bias word categories. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing.
  39. Ethical machine learning in healthcare. Annual review of biomedical data science (2021).
  40. Alexandra Chouldechova and Aaron Roth. 2020. A snapshot of the frontiers of fairness in machine learning. Commun. ACM (2020).
  41. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  42. Quantifying social biases in nlp: A generalization and empirical comparison of extrinsic fairness metrics. Transactions of the Association for Computational Linguistics (2021).
  43. Racial Bias in Hate Speech and Abusive Language Detection Datasets. In Proceedings of the Third Workshop on Abusive Language Online.
  44. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency. 120–128.
  45. NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
  46. De-identification of Patient Notes with Recurrent Neural Networks. Journal of the American Medical Informatics Association (JAMIA) (2016).
  47. Demographic bias in biometrics: A survey on an emerging challenge. IEEE Transactions on Technology and Society (2020).
  48. Russell Eisenman. 1995. Is there bias in US law enforcement? The Journal of Social, Political, and Economic Studies (1995).
  49. Stops and stares: Street stops, surveillance, and race in the new policing. Fordham Urb. LJ (2016).
  50. Chloë FitzGerald and Samia Hurst. 2017. Implicit bias in healthcare professionals: a systematic review. BMC medical ethics (2017).
  51. Launching PCORnet, a national patient-centered clinical research network. Journal of the American Medical Informatics Association (2014).
  52. F Jeff Friedlin and Clement J McDonald. 2008. A software tool for removing patient identifying information from clinical documents. Journal of the American Medical Informatics Association (2008).
  53. Milton Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association (1937).
  54. S Michael Gaddis. 2017. How black are Lakisha and Jamal? Racial perceptions from names used in correspondence audit studies. Sociological Science (2017).
  55. Assessing bias in medical ai. In Workshop on Interpretable ML in Healthcare at International Connference on Machine Learning (ICML).
  56. Marzyeh Ghassemi and Shakir Mohamed. 2022. Machine learning and health need better values. npj Digital Medicine (2022).
  57. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Summits on Translational Science Proceedings (2020).
  58. Marzyeh Ghassemi and Elaine Okanyene Nsoesie. 2022. In medicine, how do we machine learn anything real? Patterns (2022).
  59. Potential biases in machine learning algorithms using electronic health record data. JAMA internal medicine (2018).
  60. Matthew W Hahn and R Alexander Bentley. 2003. Drift as a mechanism for cultural change: an example from baby names. Proceedings of the Royal Society of London. Series B: Biological Sciences (2003).
  61. Implicit racial/ethnic bias among health care professionals and its influence on health care outcomes: a systematic review. American journal of public health (2015).
  62. Bias in online freelance marketplaces: Evidence from taskrabbit and fiverr. In Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing.
  63. Discrimination in mortgage lending: Evidence from a correspondence experiment. Journal of Urban Economics (2016).
  64. J Andrew Harris. 2015. What’s in a name? A method for extracting information about ethnicity from names. Political Analysis (2015).
  65. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).
  66. spaCy: Industrial-strength Natural Language Processing in Python. (2020).
  67. Reducing Sentiment Bias in Language Models via Counterfactual Evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020.
  68. Understanding how and why audits work in improving the quality of hospital care: A systematic realist review. PloS one (2021).
  69. Ben Hutchinson and Margaret Mitchell. 2019. 50 years of test (un) fairness: Lessons for machine learning. In Proceedings of the conference on fairness, accountability, and transparency.
  70. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  71. Audit and feedback: effects on professional practice and healthcare outcomes. Cochrane database of systematic reviews (2012).
  72. The meaning and measurement of bias: lessons from natural language processing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency.
  73. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data (2023).
  74. MIMIC-III, a freely accessible critical care database. Scientific data (2016).
  75. Mehmet Kayaalp. 2017. Modes of De-identification. In AMIA Annual Symposium Proceedings.
  76. Svetlana Kiritchenko and Saif Mohammad. 2018. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics.
  77. Measuring Bias in Contextualized Word Representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing.
  78. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning.
  79. Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  80. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering (2020).
  81. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning.
  82. Annotated MIMIC-IV discharge summaries for a study on deidentification of names (version 1.0). PhysioNet (2023). https://doi.org/10.13026/ngc0-0f54.
  83. Wendy Liu and Derek Ruths. 2013. What’s in a name? using first names as features for gender inference in twitter. In 2013 AAAI Spring Symposium Series.
  84. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  85. Bernard Lo. 2015. Sharing clinical trial data: maximizing benefits, minimizing risk. Jama (2015).
  86. Name-based demographic inference and the unequal distribution of misrecognition. Nature Human Behaviour (2023).
  87. When and how convolutional neural networks generalize to out-of-distribution category–viewpoint combinations. Nature Machine Intelligence (2022).
  88. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations.
  89. Behind the Mask: Demographic bias in name detection for PII masking. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion.
  90. The impact of unconscious bias in healthcare: how to recognize and mitigate it. The Journal of infectious diseases (2019).
  91. It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
  92. Reproducibility in machine learning for health research: Still a ways to go. Science Translational Medicine (2021).
  93. Man is to person as woman is to location: Measuring gender bias in named entity recognition. In Proceedings of the 31st ACM Conference on Hypertext and Social Media.
  94. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) (2021).
  95. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology (2010).
  96. Assessing demographic bias in named entity recognition. arXiv preprint arXiv:2008.03415 (2020).
  97. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
  98. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  99. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ digital medicine (2020).
  100. Addressing bias in big data and AI for health care: A call for open science. Patterns (2021).
  101. Jessica H Ochs. 2022. Addressing health disparities by addressing structural racism and implicit bias in nursing education. Nurse Education Today (2022).
  102. OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
  103. Bias in word embeddings. In Proceedings of the 2020 conference on fairness, accountability, and transparency.
  104. Addressing bias in artificial intelligence in health care. Jama (2019).
  105. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP).
  106. Debiasing Embeddings for Reduced Gender Bias in Text Classification. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing.
  107. Secure and robust machine learning for healthcare: A survey. IEEE Reviews in Biomedical Engineering (2020).
  108. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
  109. Inioluwa Deborah Raji and Joy Buolamwini. 2019. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society.
  110. Gender Bias in Coreference Resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers).
  111. Erik Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.
  112. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics.
  113. Gender bias in machine translation. Transactions of the Association for Computational Linguistics (2021).
  114. Stefan Schweter and Alan Akbik. 2020. Flert: Document-level features for named entity recognition. arXiv preprint arXiv:2011.06993 (2020).
  115. Global healthcare fairness: We should be sharing more, not less, data. PLOS Digital Health (2022).
  116. Medical imaging algorithms exacerbate biases in underdiagnosis. (2021).
  117. Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  118. Machine learning in healthcare: A review. In 2018 Second international conference on electronics, communication and aerospace technology (ICECA).
  119. Neutralizing Gender Bias in Word Embeddings with Latent Disentanglement and Counterfactual Generation. In Findings of the Association for Computational Linguistics: EMNLP 2020.
  120. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Briefings in Bioinformatics (2021).
  121. Evaluating Gender Bias in Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  122. Amber Stubbs and Özlem Uzuner. 2015. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics (2015).
  123. Mitigating Gender Bias in Natural Language Processing: Literature Review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  124. A successful technique for removing names in pathology reports using an augmented search and replace method.. In Proceedings of the AMIA Symposium.
  125. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature (2019).
  126. Eric J Topol. 2019. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine (2019).
  127. K TSIMA. 2023. The reproducibility issues that haunt health-care AI. Nature (2023).
  128. Protecting patient privacy when sharing patient-level data from clinical trials. BMC medical research methodology (2016).
  129. Konstantinos Tzioumis. 2018. Demographic aspects of first names. Scientific data (2018).
  130. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association (2007).
  131. A de-identifier for medical discharge summaries. Artificial intelligence in medicine (2008).
  132. Social bias, discrimination and inequity in healthcare: mechanisms, implications and recommendations. BJA education (2022).
  133. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).
  134. Eric W Weisstein. 2004. Bonferroni correction. https://mathworld. wolfram. com/ (2004).
  135. David R Williams and Ronald Wyatt. 2015. Racial bias in health care and health: challenges and opportunities. JAMA (2015).
  136. Robert F Woolson. 2007. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials (2007).
  137. Vikas Yadav and Steven Bethard. 2018. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In Proceedings of the 27th International Conference on Computational Linguistics.
  138. Hui Yang and Jonathan M Garibaldi. 2015. Automatic detection of protected health information from clinic narratives. Journal of biomedical informatics (2015).
  139. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. In Proceedings of ACL 2019.
  140. Examining the presence, consequences, and reduction of implicit bias in health care: a narrative review. Group Processes & Intergroup Relations (2016).
  141. Hurtful words: quantifying biases in clinical contextual word embeddings. In proceedings of the ACM Conference on Health, Inference, and Learning.
  142. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers).
Citations (9)

Summary

We haven't generated a summary for this paper yet.