Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs (2406.15927v1)

Published 22 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in LLMs. Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.

The paper introduces Semantic Entropy Probes (SEPs), a method for uncertainty quantification in LLMs to detect hallucinations in a cost-efficient manner. Hallucinations are a significant impediment to the adoption of LLMs, and detecting them is crucial for safe deployment.

The paper addresses the computational cost associated with semantic entropy (SE), which has been shown to be effective for hallucination detection. While SE can detect hallucinations by estimating uncertainty in the space of semantic meaning for a set of model generations, it requires multiple model generations, increasing the computation cost by 5 to 10 times. SEPs directly approximate SE from the hidden states of a single generation, reducing the computational overhead.

The contributions of this work are:

  • Proposing SEPs, linear probes trained on the hidden states of LLMs to capture semantic entropy.
  • Demonstrating that semantic entropy is encoded in the hidden states of a single model generation and can be extracted using probes.
  • Ablation studies to paper SEP performance across models, tasks, layers, and token positions.
  • Demonstrating that SEPs can predict hallucinations and generalize better than probes directly trained for accuracy.

Related work includes sampling-based hallucination detection methods, retrieval-based methods, and techniques for understanding hidden states. Sampling-based methods sample multiple model completions and quantify the semantic difference between them. Retrieval-based methods rely on external knowledge bases to verify the factuality of model responses. Recent work has shown that simple operations on LLM hidden states can qualitatively change model behavior.

The paper uses the semantic entropy measure proposed by Farquhar et al., which involves sampling model completions, aggregating the generations into clusters of equivalent semantic meaning, and calculating semantic entropy by aggregating uncertainties within each cluster. The probability of a semantic cluster CC given an input context xx is given by:

p(Cx)=sCp(sx)p(C \mid x) = \sum\nolimits_{s \in C} p(s \mid x)

  • p(Cx)p(C \mid x): The probability of semantic cluster CC given input context xx
  • ss: a generation
  • xx: an input context

The uncertainty associated with the distribution over semantic clusters is the semantic entropy, defined as:

H[Cx]=Ep(Cx)[logp(Cx)]H[C\mid x] = \mathbb{E}_{p(C\mid x)}[-\log p(C\mid x)]

  • H[Cx]H[C\mid x]: Semantic entropy given input context xx
  • p(Cx)p(C \mid x): The probability of semantic cluster CC given input context xx

In practice, semantic entropy is estimated using Monte Carlo sampling.

SEPs are trained as linear logistic regression models on the hidden states of LLMs to predict semantic entropy. A dataset of (hpl(x),HSE(x))(h_{p}^l(x), H_{\textup{SE}(x)}) pairs is created, where xx is an input query, hpl(x)Rdh_{p}^{l}(x)\in \mathbb{R}^d is the model hidden state at token position pp and layer ll, dd is the hidden state dimension, and $H_{\textup{SE}(x)\in\mathbb{R}$ is the semantic entropy. The semantic entropy scores are converted into binary labels, indicating whether semantic entropy is high or low.

The probes are evaluated on four datasets: TriviaQA, SQuAD, BioASQ, and NQ Open, in both short- and long-form settings. The models used include Llama-2 7B and 70B, Mistral 7B, and Phi-3 Mini. The baselines include ground truth semantic entropy, accuracy probes supervised with model correctness labels, naive entropy, log likelihood, and the p(True)p(\text{True}) method. The area under the receiver operating characteristic curve (AUROC) is used to evaluate the performance.

The experiments show that SEPs can capture semantic entropy across different models and tasks. In general, AUROC values increase for later layers in the model, reaching values between 0.7 and 0.95 depending on the scenario. SEPs can capture semantic entropy even before generation, with performance slightly below the second-last-token experiments. In a counterfactual context addition experiment, the distribution for p(high SE)p(\text{high SE}) from the SEP is concentrated around 0.9 without context. However, when context is provided, p(high SE)p(\text{high SE}) decreases, as shown by the shift in distribution.

The paper also explores the use of SEPs to predict hallucinations, comparing them to accuracy probes and other baselines. In-distribution, accuracy probes outperform SEPs across most layers and tasks. However, when evaluating probe generalization to new tasks, SEPs consistently outperform accuracy probes. While SEPs cannot match the performance of other, costlier baselines, they represent a cost-effective approach to uncertainty quantification in LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jannik Kossen (14 papers)
  2. Jiatong Han (5 papers)
  3. Muhammed Razzak (6 papers)
  4. Lisa Schut (11 papers)
  5. Shreshth Malik (1 paper)
  6. Yarin Gal (170 papers)
Citations (12)