The paper introduces Semantic Entropy Probes (SEPs), a method for uncertainty quantification in LLMs to detect hallucinations in a cost-efficient manner. Hallucinations are a significant impediment to the adoption of LLMs, and detecting them is crucial for safe deployment.
The paper addresses the computational cost associated with semantic entropy (SE), which has been shown to be effective for hallucination detection. While SE can detect hallucinations by estimating uncertainty in the space of semantic meaning for a set of model generations, it requires multiple model generations, increasing the computation cost by 5 to 10 times. SEPs directly approximate SE from the hidden states of a single generation, reducing the computational overhead.
The contributions of this work are:
- Proposing SEPs, linear probes trained on the hidden states of LLMs to capture semantic entropy.
- Demonstrating that semantic entropy is encoded in the hidden states of a single model generation and can be extracted using probes.
- Ablation studies to paper SEP performance across models, tasks, layers, and token positions.
- Demonstrating that SEPs can predict hallucinations and generalize better than probes directly trained for accuracy.
Related work includes sampling-based hallucination detection methods, retrieval-based methods, and techniques for understanding hidden states. Sampling-based methods sample multiple model completions and quantify the semantic difference between them. Retrieval-based methods rely on external knowledge bases to verify the factuality of model responses. Recent work has shown that simple operations on LLM hidden states can qualitatively change model behavior.
The paper uses the semantic entropy measure proposed by Farquhar et al., which involves sampling model completions, aggregating the generations into clusters of equivalent semantic meaning, and calculating semantic entropy by aggregating uncertainties within each cluster. The probability of a semantic cluster given an input context is given by:
- : The probability of semantic cluster given input context
- : a generation
- : an input context
The uncertainty associated with the distribution over semantic clusters is the semantic entropy, defined as:
- : Semantic entropy given input context
- : The probability of semantic cluster given input context
In practice, semantic entropy is estimated using Monte Carlo sampling.
SEPs are trained as linear logistic regression models on the hidden states of LLMs to predict semantic entropy. A dataset of pairs is created, where is an input query, is the model hidden state at token position and layer , is the hidden state dimension, and $H_{\textup{SE}(x)\in\mathbb{R}$ is the semantic entropy. The semantic entropy scores are converted into binary labels, indicating whether semantic entropy is high or low.
The probes are evaluated on four datasets: TriviaQA, SQuAD, BioASQ, and NQ Open, in both short- and long-form settings. The models used include Llama-2 7B and 70B, Mistral 7B, and Phi-3 Mini. The baselines include ground truth semantic entropy, accuracy probes supervised with model correctness labels, naive entropy, log likelihood, and the method. The area under the receiver operating characteristic curve (AUROC) is used to evaluate the performance.
The experiments show that SEPs can capture semantic entropy across different models and tasks. In general, AUROC values increase for later layers in the model, reaching values between 0.7 and 0.95 depending on the scenario. SEPs can capture semantic entropy even before generation, with performance slightly below the second-last-token experiments. In a counterfactual context addition experiment, the distribution for from the SEP is concentrated around 0.9 without context. However, when context is provided, decreases, as shown by the shift in distribution.
The paper also explores the use of SEPs to predict hallucinations, comparing them to accuracy probes and other baselines. In-distribution, accuracy probes outperform SEPs across most layers and tasks. However, when evaluating probe generalization to new tasks, SEPs consistently outperform accuracy probes. While SEPs cannot match the performance of other, costlier baselines, they represent a cost-effective approach to uncertainty quantification in LLMs.