Reducing Large Language Model Safety Risks in Women's Health using Semantic Entropy (2503.00269v1)

Published 1 Mar 2025 in cs.LG, cs.AI, cs.CL, and cs.CY

Abstract: LLMs hold substantial promise for clinical decision support. However, their widespread adoption in medicine, particularly in healthcare, is hindered by their propensity to generate false or misleading outputs, known as hallucinations. In high-stakes domains such as women's health (obstetrics & gynaecology), where errors in clinical reasoning can have profound consequences for maternal and neonatal outcomes, ensuring the reliability of AI-generated responses is critical. Traditional methods for quantifying uncertainty, such as perplexity, fail to capture meaning-level inconsistencies that lead to misinformation. Here, we evaluate semantic entropy (SE), a novel uncertainty metric that assesses meaning-level variation, to detect hallucinations in AI-generated medical content. Using a clinically validated dataset derived from UK RCOG MRCOG examinations, we compared SE with perplexity in identifying uncertain responses. SE demonstrated superior performance, achieving an AUROC of 0.76 (95% CI: 0.75-0.78), compared to 0.62 (0.60-0.65) for perplexity. Clinical expert validation further confirmed its effectiveness, with SE achieving near-perfect uncertainty discrimination (AUROC: 0.97). While semantic clustering was successful in only 30% of cases, SE remains a valuable tool for improving AI safety in women's health. These findings suggest that SE could enable more reliable AI integration into clinical practice, particularly in resource-limited settings where LLMs could augment care. This study highlights the potential of SE as a key safeguard in the responsible deployment of AI-driven tools in women's health, leading to safer and more effective digital health interventions.

Summary

The paper evaluates semantic entropy (SE) against perplexity for detecting large language model (LLM) hallucinations in women's health using a clinical RCOG exam dataset.
Key findings show that semantic entropy significantly outperformed perplexity in uncertainty discrimination, achieving an AUROC of 0.76 compared to 0.62.
The study suggests semantic entropy is a promising tool for improving LLM safety in clinical applications, particularly in sensitive areas like women's health.

The paper "Reducing LLM Safety Risks in Women's Health using Semantic Entropy" explores the use of semantic entropy (SE) to detect hallucinations in LLMs within the context of women's health. The paper leverages a clinically validated dataset from the UK Royal College of Obstetricians and Gynaecologists (RCOG) MRCOG examinations to compare SE with perplexity in identifying uncertain responses generated by the GPT-4o model.

The authors compiled 1,824 MRCOG questions from eight distinct sources, with certified clinical experts in obstetrics and gynaecology (O&G) reviewing each question. After filtering for compatibility with short-answer formats and exclusion of questions requiring image or table interpretation, the final dataset comprised 1,644 questions, divided into Part One (knowledge retrieval) and Part Two (clinical reasoning) categories.

Key findings include:

Superior Performance of Semantic Entropy: SE significantly outperformed perplexity in uncertainty discrimination, achieving an area under the receiver operating characteristic curve (AUROC) of 0.76 (95% CI: 0.75–0.78), compared to 0.62 (0.60–0.65) for perplexity. Accuracy remained consistent across metrics at approximately 50%.
Subgroup Analysis: The paper conducted subgroup analyses to evaluate performance differences between Part One and Part Two questions. GPT-4o demonstrated higher accuracy on Part One questions, indicating better-calibrated uncertainty measures for knowledge retrieval tasks. SE consistently showed better uncertainty calibration than perplexity across both parts.
Knowledge vs. Reasoning Tasks: Questions were categorized as knowledge retrieval or clinical reasoning tasks. SE exhibited better uncertainty calibration than perplexity across both types.
Response Length: Shorter response sequences achieved significantly higher accuracy and AUROC across all metrics, reflecting better correctness and uncertainty discrimination. SE outperformed perplexity on longer responses.
Temperature Effects: Increasing the model temperature from 0.2 to 1.0 improved AUROC values across all uncertainty metrics, indicating enhanced uncertainty discrimination at greater response variability.
Clinical Expert Validation: Three O&G specialists evaluated a subset of 105 randomly selected MRCOG questions and responses. Single-cluster responses achieved the highest accuracy (90.48%), decreasing with increasing cluster count. The semantic clustering was fully successful in only 30% of cases.
Correctness Definition: The definition of a correct response affected the model's accuracy but did not significantly impact the uncertainty discrimination of the SE metric.

The methodology involved adapting domain-specific questions from single best answer (SBA) and extended matching questions (EMQ) formats of the MRCOG exams. Prompts were designed following established best practices, and output randomness was controlled using the temperature parameter. SE was computed by generating multiple responses for each prompt, clustering these responses based on semantic similarity using bidirectional entailment, and calculating entropy based on the distribution of responses across clusters.

The authors measured correctness by bidirectional entailment between the model’s response and the reference answer. They applied two correctness criteria: for perplexity, the response with the lowest perplexity was deemed correct if bidirectionally entailed by the reference answer; for SE, the largest semantic cluster represented the highest-confidence meaning, and correctness was determined by bidirectional entailment of the lowest perplexity response within this cluster.

The discussion highlights SE's potential to mitigate model bias through continuous review of data and outputs. It also addresses the heightened risks in women's health, particularly in O&G, where misinformation can have severe consequences.

In conclusion, the paper supports the promise of SE as a tool for improving LLM safety in clinical applications, with future research aimed at developing domain-specific LLMs tailored to women's health and creating robust toolkits for auditing outputs and mitigating biases.

Definitions for the LaTeX formulas are as follows: