Evaluating Biases in Context-Dependent Health Questions (2403.04858v1)
Abstract: Chat-based LLMs have the opportunity to empower individuals lacking high-quality healthcare access to receive personalized information across a variety of topics. However, users may ask underspecified questions that require additional context for a model to correctly answer. We study how LLM biases are exhibited through these contextual questions in the healthcare domain. To accomplish this, we curate a dataset of sexual and reproductive healthcare questions that are dependent on age, sex, and location attributes. We compare models' outputs with and without demographic context to determine group alignment among our contextual questions. Our experiments reveal biases in each of these attributes, where young adult female users are favored.
- Designing guiding principles for nlp for healthcare: A case study of maternal health. arXiv preprint arXiv:2312.11803.
- Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292–2307, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
- Selectively answering ambiguous questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 530–543, Singapore. Association for Computational Linguistics.
- Julia Hussein and Laura Ferguson. 2019. Eliminating stigma and discrimination in sexual and reproductive health care: a public health imperative.
- Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. arXiv e-prints, pages arXiv–2310.
- BeliefBank: Adding memory to a pre-trained language model for a systematic notion of belief. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8849–8861, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- UNQOVERing stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3475–3489, Online. Association for Computational Linguistics.
- Q-pain: A question answering dataset to measure social bias in pain management. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- AmbigQA: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online. Association for Computational Linguistics.
- Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195.
- Shramay Palta and Rachel Rudinger. 2023. FORK: A bite-sized test set for probing culinary cultural biases in commonsense reasoning models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9952–9962, Toronto, Canada. Association for Computational Linguistics.
- How context affects language models’ factual predictions. In Automated Knowledge Base Construction.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Emerging challenges in personalized medicine: Assessing demographic effects on biomedical question answering systems. arXiv preprint arXiv:2310.10571.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Christopher Warshaw and Chris Tausanovitch. 2022. Subnational ideology and presidential vote estimates (v2022). Harvard Dataverse. https://doi. org/10.7910/DVN/BQKU4M.
- Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge clashes. arXiv preprint arXiv:2305.13300.
- Hurtful words: quantifying biases in clinical contextual word embeddings. In proceedings of the ACM Conference on Health, Inference, and Learning, pages 110–120.
- Context-faithful prompting for large language models. arXiv preprint arXiv:2303.11315.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.