Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs (2406.15927v1)

Published 22 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in LLMs. Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.

PDF HTML Abstract

The paper introduces Semantic Entropy Probes (SEPs), a method for uncertainty quantification in LLMs to detect hallucinations in a cost-efficient manner. Hallucinations are a significant impediment to the adoption of LLMs, and detecting them is crucial for safe deployment.

The paper addresses the computational cost associated with semantic entropy (SE), which has been shown to be effective for hallucination detection. While SE can detect hallucinations by estimating uncertainty in the space of semantic meaning for a set of model generations, it requires multiple model generations, increasing the computation cost by 5 to 10 times. SEPs directly approximate SE from the hidden states of a single generation, reducing the computational overhead.

The contributions of this work are:

Proposing SEPs, linear probes trained on the hidden states of LLMs to capture semantic entropy.
Demonstrating that semantic entropy is encoded in the hidden states of a single model generation and can be extracted using probes.
Ablation studies to paper SEP performance across models, tasks, layers, and token positions.
Demonstrating that SEPs can predict hallucinations and generalize better than probes directly trained for accuracy.

Related work includes sampling-based hallucination detection methods, retrieval-based methods, and techniques for understanding hidden states. Sampling-based methods sample multiple model completions and quantify the semantic difference between them. Retrieval-based methods rely on external knowledge bases to verify the factuality of model responses. Recent work has shown that simple operations on LLM hidden states can qualitatively change model behavior.

The paper uses the semantic entropy measure proposed by Farquhar et al., which involves sampling model completions, aggregating the generations into clusters of equivalent semantic meaning, and calculating semantic entropy by aggregating uncertainties within each cluster. The probability of a semantic cluster $C$ given an input context $x$ is given by:

$p(C \mid x) = \sum\nolimits_{s \in C} p(s \mid x)$

$p(C \mid x)$ : The probability of semantic cluster $C$ given input context $x$
$s$ : a generation
$x$ : an input context

The uncertainty associated with the distribution over semantic clusters is the semantic entropy, defined as:

$H[C\mid x] = \mathbb{E}_{p(C\mid x)}[-\log p(C\mid x)]$

$H[C\mid x]$ : Semantic entropy given input context $x$
$p(C \mid x)$ : The probability of semantic cluster $C$ given input context $x$

In practice, semantic entropy is estimated using Monte Carlo sampling.

SEPs are trained as linear logistic regression models on the hidden states of LLMs to predict semantic entropy. A dataset of $(h_{p}^l(x), H_{\textup{SE}(x)})$ pairs is created, where $x$ is an input query, $h_{p}^{l}(x)\in \mathbb{R}^d$ is the model hidden state at token position $p$ and layer $l$ , $d$ is the hidden state dimension, and $H_{\textup{SE}(x)\in\mathbb{R}$ is the semantic entropy. The semantic entropy scores are converted into binary labels, indicating whether semantic entropy is high or low.

The probes are evaluated on four datasets: TriviaQA, SQuAD, BioASQ, and NQ Open, in both short- and long-form settings. The models used include Llama-2 7B and 70B, Mistral 7B, and Phi-3 Mini. The baselines include ground truth semantic entropy, accuracy probes supervised with model correctness labels, naive entropy, log likelihood, and the $p(\text{True})$ method. The area under the receiver operating characteristic curve (AUROC) is used to evaluate the performance.

The experiments show that SEPs can capture semantic entropy across different models and tasks. In general, AUROC values increase for later layers in the model, reaching values between 0.7 and 0.95 depending on the scenario. SEPs can capture semantic entropy even before generation, with performance slightly below the second-last-token experiments. In a counterfactual context addition experiment, the distribution for $p(\text{high SE})$ from the SEP is concentrated around 0.9 without context. However, when context is provided, $p(\text{high SE})$ decreases, as shown by the shift in distribution.

The paper also explores the use of SEPs to predict hallucinations, comparing them to accuracy probes and other baselines. In-distribution, accuracy probes outperform SEPs across most layers and tasks. However, when evaluating probe generalization to new tasks, SEPs consistently outperform accuracy probes. While SEPs cannot match the performance of other, costlier baselines, they represent a cost-effective approach to uncertainty quantification in LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jannik Kossen (14 papers)
Jiatong Han (5 papers)
Muhammed Razzak (6 papers)
Lisa Schut (11 papers)
Shreshth Malik (1 paper)
Yarin Gal (170 papers)

Citations (12)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/koltregaskes/status/1843321461773156649

https://twitter.com/koltregaskes/status/1843277469186662705

https://twitter.com/IlyasHairline/status/1805825056175468854

https://twitter.com/segfaulte/status/1843763224312983726

https://twitter.com/knishimae0531/status/1805585936811712814

https://twitter.com/knishimae0531/status/1805806757652086834