- The paper evaluates semantic entropy (SE) against perplexity for detecting large language model (LLM) hallucinations in women's health using a clinical RCOG exam dataset.
- Key findings show that semantic entropy significantly outperformed perplexity in uncertainty discrimination, achieving an AUROC of 0.76 compared to 0.62.
- The study suggests semantic entropy is a promising tool for improving LLM safety in clinical applications, particularly in sensitive areas like women's health.
The paper "Reducing LLM Safety Risks in Women's Health using Semantic Entropy" explores the use of semantic entropy (SE) to detect hallucinations in LLMs within the context of women's health. The paper leverages a clinically validated dataset from the UK Royal College of Obstetricians and Gynaecologists (RCOG) MRCOG examinations to compare SE with perplexity in identifying uncertain responses generated by the GPT-4o model.
The authors compiled 1,824 MRCOG questions from eight distinct sources, with certified clinical experts in obstetrics and gynaecology (O&G) reviewing each question. After filtering for compatibility with short-answer formats and exclusion of questions requiring image or table interpretation, the final dataset comprised 1,644 questions, divided into Part One (knowledge retrieval) and Part Two (clinical reasoning) categories.
Key findings include:
- Superior Performance of Semantic Entropy: SE significantly outperformed perplexity in uncertainty discrimination, achieving an area under the receiver operating characteristic curve (AUROC) of 0.76 (95% CI: 0.75–0.78), compared to 0.62 (0.60–0.65) for perplexity. Accuracy remained consistent across metrics at approximately 50%.
- Subgroup Analysis: The paper conducted subgroup analyses to evaluate performance differences between Part One and Part Two questions. GPT-4o demonstrated higher accuracy on Part One questions, indicating better-calibrated uncertainty measures for knowledge retrieval tasks. SE consistently showed better uncertainty calibration than perplexity across both parts.
- Knowledge vs. Reasoning Tasks: Questions were categorized as knowledge retrieval or clinical reasoning tasks. SE exhibited better uncertainty calibration than perplexity across both types.
- Response Length: Shorter response sequences achieved significantly higher accuracy and AUROC across all metrics, reflecting better correctness and uncertainty discrimination. SE outperformed perplexity on longer responses.
- Temperature Effects: Increasing the model temperature from 0.2 to 1.0 improved AUROC values across all uncertainty metrics, indicating enhanced uncertainty discrimination at greater response variability.
- Clinical Expert Validation: Three O&G specialists evaluated a subset of 105 randomly selected MRCOG questions and responses. Single-cluster responses achieved the highest accuracy (90.48%), decreasing with increasing cluster count. The semantic clustering was fully successful in only 30% of cases.
- Correctness Definition: The definition of a correct response affected the model's accuracy but did not significantly impact the uncertainty discrimination of the SE metric.
The methodology involved adapting domain-specific questions from single best answer (SBA) and extended matching questions (EMQ) formats of the MRCOG exams. Prompts were designed following established best practices, and output randomness was controlled using the temperature parameter. SE was computed by generating multiple responses for each prompt, clustering these responses based on semantic similarity using bidirectional entailment, and calculating entropy based on the distribution of responses across clusters.
The authors measured correctness by bidirectional entailment between the model’s response and the reference answer. They applied two correctness criteria: for perplexity, the response with the lowest perplexity was deemed correct if bidirectionally entailed by the reference answer; for SE, the largest semantic cluster represented the highest-confidence meaning, and correctness was determined by bidirectional entailment of the lowest perplexity response within this cluster.
The discussion highlights SE's potential to mitigate model bias through continuous review of data and outputs. It also addresses the heightened risks in women's health, particularly in O&G, where misinformation can have severe consequences.
In conclusion, the paper supports the promise of SE as a tool for improving LLM safety in clinical applications, with future research aimed at developing domain-specific LLMs tailored to women's health and creating robust toolkits for auditing outputs and mitigating biases.
Definitions for the LaTeX formulas are as follows:
- M = number of responses generated for a given prompt