Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 74 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Evaluating Biases in Context-Dependent Health Questions (2403.04858v1)

Published 7 Mar 2024 in cs.CL

Abstract: Chat-based LLMs have the opportunity to empower individuals lacking high-quality healthcare access to receive personalized information across a variety of topics. However, users may ask underspecified questions that require additional context for a model to correctly answer. We study how LLM biases are exhibited through these contextual questions in the healthcare domain. To accomplish this, we curate a dataset of sexual and reproductive healthcare questions that are dependent on age, sex, and location attributes. We compare models' outputs with and without demographic context to determine group alignment among our contextual questions. Our experiments reveal biases in each of these attributes, where young adult female users are favored.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Designing guiding principles for nlp for healthcare: A case study of maternal health. arXiv preprint arXiv:2312.11803.
  2. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292–2307, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  3. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
  4. Selectively answering ambiguous questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 530–543, Singapore. Association for Computational Linguistics.
  5. Julia Hussein and Laura Ferguson. 2019. Eliminating stigma and discrimination in sexual and reproductive health care: a public health imperative.
  6. Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. arXiv e-prints, pages arXiv–2310.
  7. BeliefBank: Adding memory to a pre-trained language model for a systematic notion of belief. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8849–8861, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  8. UNQOVERing stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3475–3489, Online. Association for Computational Linguistics.
  9. Q-pain: A question answering dataset to measure social bias in pain management. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  10. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  11. AmbigQA: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online. Association for Computational Linguistics.
  12. Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195.
  13. Shramay Palta and Rachel Rudinger. 2023. FORK: A bite-sized test set for probing culinary cultural biases in commonsense reasoning models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9952–9962, Toronto, Canada. Association for Computational Linguistics.
  14. How context affects language models’ factual predictions. In Automated Knowledge Base Construction.
  15. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  16. Emerging challenges in personalized medicine: Assessing demographic effects on biomedical question answering systems. arXiv preprint arXiv:2310.10571.
  17. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  18. Christopher Warshaw and Chris Tausanovitch. 2022. Subnational ideology and presidential vote estimates (v2022). Harvard Dataverse. https://doi. org/10.7910/DVN/BQKU4M.
  19. Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge clashes. arXiv preprint arXiv:2305.13300.
  20. Hurtful words: quantifying biases in clinical contextual word embeddings. In proceedings of the ACM Conference on Health, Inference, and Learning, pages 110–120.
  21. Context-faithful prompting for large language models. arXiv preprint arXiv:2303.11315.
Citations (1)

Summary

  • The paper finds that chat-based LLMs exhibit significant biases by favoring responses for young adults and female demographics.
  • The paper employs a curated dataset from Planned Parenthood and Go Ask Alice, analyzing responses using cosine similarity and % Win metrics across two LLMs.
  • The paper emphasizes the need for refining LLM training to ensure unbiased and equitable health information delivery across diverse demographics.

Unveiling Bias in Chat-based LLMs' Responses to Contextual Health Questions

Introduction to the Study

Recent advancements have positioned chat-based LLMs at the forefront of accessible, personalized information delivery, with significant implications for sectors lacking in resources, notably healthcare. Given the reliance on LLMs for health-related inquiries, it becomes paramount to address and understand the inherent biases in LLM responses, particularly when answering contextual health questions that lack specified demographic information. In the scrutinized paper, researchers from John Hopkins University and Northeastern Illinois University dive into this issue by evaluating biases in chat-based LLMs, primarily focusing on sexual and reproductive healthcare questions necessitating additional context related to age, sex, and location.

Methodology and Data

The research commenced with the curation of a dataset comprising contextual questions related to sexual and reproductive health, heavily reliant on personal attributes for accurate answering. Sourced from prominent health advisory platforms like Planned Parenthood and Go Ask Alice, the dataset underwent meticulous filtering to emphasize questions contingent on age, location, or sex. This dataset not only reflected the diverse inquiries individuals might have but also served as a base to probe two leading LLMs, gpt-3.5-turbo and llama-2-70b-chat, for biases in their response patterns.

For a comparative analysis, the models were probed with questions tagged with additional demographic context versus standalone inquiries. The analytics revolved around measuring the variance in response alignment towards different demographic groups, utilizing metrics like average cosine similarity scores and the percentage of most aligned answers ('% Win').

Key Findings and Observations

The results were revealing across several dimensions:

  • Age Bias: Both LLMs exhibited a discernible alignment towards the 18-30 age demographic, emphasizing a potential oversight towards the health inquiries of older individuals.
  • Sex Bias: Responses skewed towards female demographics, indicating an inherent bias in addressing sexual and reproductive health questions, which may marginalize male-related health queries.
  • Location-Based Bias: Minor fluctuations were noted in responses related to users' locations, with a slight tendency towards assuming the user resides in Massachusetts. This might reflect broader societal biases or model training data biases.

The human evaluation component further cemented these findings, showcasing a substantial alignment between quantitative biases and human perception, especially in age and sex attributes.

Implications and Future Directions

The research contributes significantly to the discourse on AI ethics, particularly emphasizing the urgency to mitigate biases in health information dissemination. The implications straddle practical healthcare information access and broader questions about privacy, as users might need to provide unnecessary demographic details to obtain accurate advice. For future AI research and LLM development, creating models that offer comprehensive, unbiased responses across all demographics could help ensure privacy and improve the quality and reliability of AI-driven health advisories.

Limitations and Ethical Considerations

The paper meticulously outlines its limitations, including its focus on an American-centric dataset, reliance on binary sex categories, and the dynamic nature of healthcare laws affecting location-dependent questions. Additionally, the ethical considerations surrounding the collection and use of healthcare-related inquiries underscore the sensitivity and responsibility demanded in conducting such research.

Concluding Remarks

The paper offers a rigorous examination of biases inherent in LLMs when tasked with health-related contextual inquiries. By shedding light on the predispositions favoring certain demographics over others, the research invites a critical reassessment of how LLMs are trained, evaluated, and deployed, advocating for a future where AI can serve diverse global populations equitably and sensitively in health matters and beyond.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 97 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube