Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation (2302.09664v3)

Published 19 Feb 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce a method to measure uncertainty in LLMs. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of "semantic equivalence" -- different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy -- an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf LLMs. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.

Citations (185)

Summary

  • The paper introduces Semantic Entropy to quantify uncertainty by clustering semantically equivalent outputs in language generation.
  • It details a three-step process involving sampling, bidirectional NLI-based semantic clustering, and Shannon entropy computation.
  • Experimental results on CoQA and TriviaQA with OPT models show Semantic Entropy outperforms traditional predictive entropy methods.

The paper "Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation" (Kuhn et al., 2023 ) introduces Semantic Entropy (SE) as a novel measure for quantifying uncertainty in the outputs of LLMs engaged in natural language generation (NLG) tasks, particularly free-form question answering (QA). The core challenge addressed is that standard uncertainty metrics, often adapted from classification tasks, struggle with the phenomenon of semantic equivalence in language, where multiple distinct surface forms (sequences of tokens) can convey identical meanings. Consequently, a model might exhibit high uncertainty at the sequence level (e.g., distributing probability mass across "Paris is the capital of France" and "France's capital city is Paris") even when it is highly certain about the underlying semantic content. SE aims to provide a more meaningful uncertainty estimate by operating at the level of semantic meanings rather than lexical sequences.

Problem Formulation and Limitations of Existing Methods

Traditional methods for uncertainty estimation in generative models often rely on metrics like Predictive Entropy (PE), calculated directly over the model's output distribution p(sx)p(s|x) for sequences ss given an input xx. PE is defined as H[p(sx)]=sp(sx)logp(sx)H[p(s|x)] = - \sum_s p(s|x) \log p(s|x), typically approximated using Monte Carlo sampling: H^1Ni=1Nlogp(six)\hat{H} \approx - \frac{1}{N} \sum_{i=1}^N \log p(s_i|x), where sip(sx)s_i \sim p(s|x).

However, PE suffers from several drawbacks in the context of NLG:

  1. Semantic Equivalence: As highlighted, PE conflates lexical variation with genuine semantic uncertainty. High PE can arise simply because the model considers multiple ways to phrase the same answer, not necessarily because it is unsure of the correct answer itself.
  2. High-Dimensional Output Space: The space of possible sequences is vast, making accurate estimation of PE via sampling difficult.
  3. Variable Sequence Length: Autoregressive models naturally assign lower probabilities to longer sequences. PE, calculated using sequence log-probabilities, can be biased towards shorter sequences, potentially misrepresenting uncertainty, especially if correct answers tend to be longer or shorter. Length-Normalized PE (LN-PE) attempts to mitigate this but doesn't address semantic equivalence.

Other approaches, like model ensembles or Monte Carlo dropout, require multiple model forward passes or specific architectures, while SE is designed for single, off-the-shelf LLMs. Self-evaluation methods (e.g., p(True)) require specific prompting strategies and may not capture inherent distributional uncertainty.

The Semantic Entropy Method

Semantic Entropy attempts to overcome these limitations by explicitly incorporating linguistic invariances. It estimates entropy over the distribution of meanings rather than sequences. The procedure involves three key steps:

  1. Sampling: Generate a set of NN candidate output sequences {s1,...,sN}\{s_1, ..., s_N\} from the LLM's predictive distribution p(sx)p(s|x) given the input context xx. This is typically done using multinomial sampling with a specific temperature TT. The probability of each sampled sequence, p(six)p(s_i|x), is calculated as the product of the conditional probabilities of its constituent tokens: p(six)=j=1sip(ti,jti,1,...,ti,j1,x)p(s_i|x) = \prod_{j=1}^{|s_i|} p(t_{i,j} | t_{i,1}, ..., t_{i, j-1}, x).
  2. Semantic Clustering: Group the sampled sequences {si}\{s_i\} into clusters {ck}\{c_k\} such that all sequences within a cluster share the same meaning relative to the input context xx. The paper proposes a bidirectional entailment algorithm for this. Two sequences sis_i and sjs_j are deemed semantically equivalent if and only if the combined input-output pair (x,si)(x, s_i) entails (x,sj)(x, s_j), AND (x,sj)(x, s_j) entails (x,si)(x, s_i). This check is performed using a pre-trained Natural Language Inference (NLI) model. Sequences satisfying this mutual entailment condition are assigned to the same semantic cluster.
  3. Entropy Calculation: First, estimate the probability mass associated with each semantic cluster ckc_k. This is done by summing the original sequence probabilities (or using counts from the samples) for all sequences belonging to that cluster:

    p(ckx)=sickp(six)p(c_k|x) = \sum_{s_i \in c_k} p(s_i|x)

    The probability distribution over clusters is then normalized: p^(ckx)=p(ckx)jp(cjx)\hat{p}(c_k|x) = \frac{p(c_k|x)}{\sum_{j} p(c_j|x)}. Finally, Semantic Entropy is computed as the Shannon entropy over this distribution of semantic clusters:

    SE(x)=kp^(ckx)logp^(ckx)SE(x) = - \sum_k \hat{p}(c_k|x) \log \hat{p}(c_k|x)

    In practice, using Monte Carlo samples, the cluster probability can be approximated by the proportion of samples falling into that cluster: p^(ckx)ckN\hat{p}(c_k|x) \approx \frac{|c_k|}{N}, where ck|c_k| is the number of samples in cluster kk. The paper uses the sum of probabilities formulation, which might be more stable. Optionally, sequence log-probabilities logp(six)\log p(s_i|x) can be length-normalized before summation and entropy calculation to mitigate length bias, analogous to LN-PE.

Implementation Details and Considerations

Implementing Semantic Entropy involves several practical choices:

  • NLI Model: The paper uses DeBERTa-large fine-tuned on MNLI. The choice of NLI model impacts the quality of the semantic clustering. The NLI model needs to reliably predict entailment between pairs like (premise=x+s_i, hypothesis=x+s_j). The accuracy and calibration of the NLI model are crucial.
  • Clustering Algorithm: The bidirectional entailment check requires O(N2)O(N^2) NLI model inferences for NN samples, which can be computationally expensive for large NN. Potential optimizations might involve approximate nearest neighbor search in semantic embedding space as a pre-filtering step, followed by NLI checks only for close candidates, although this is not explored in the paper.
  • Sampling Strategy: Multinomial sampling is preferred over beam search as it provides greater diversity, which is beneficial for exploring the distribution and identifying distinct semantic modes. The sampling temperature TT is a critical hyperparameter. The paper finds intermediate temperatures (e.g., T=0.5T=0.5) often perform best, balancing diversity and plausibility. Too high a temperature leads to noisy, low-probability samples, while too low a temperature collapses the distribution, hiding uncertainty. The number of samples NN directly impacts the quality of the Monte Carlo approximation; higher NN improves estimation but increases computational cost (both sampling and clustering).
  • Sequence Probability Calculation: Requires access to the token-level probabilities from the LLM during generation. For API-based models that don't provide token probabilities, SE cannot be directly computed using the probability-summation approach, though the count-based approximation p^(ckx)ck/N\hat{p}(c_k|x) \approx |c_k|/N remains feasible if sampling is possible.
  • Length Normalization: The paper notes that applying length normalization (dividing log-probabilities by sequence length or applying a penalty like ((αL+β)/(β+1))γ((\alpha L + \beta) / (\beta + 1))^{\gamma}) can be beneficial, particularly when sequence lengths vary significantly. The optimal normalization strategy might be task-dependent.

Pseudocode for Semantic Entropy Calculation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def semantic_entropy(model, tokenizer, context, num_samples=50, temperature=0.5, max_len=50):
  """
  Calculates Semantic Entropy for a given context.

  Args:
    model: The LLM (e.g., Hugging Face transformer).
    tokenizer: The tokenizer for the model.
    context: The input string/prompt.
    num_samples: Number of sequences to sample (N).
    temperature: Sampling temperature (T).
    max_len: Max generation length.

  Returns:
    Semantic entropy value.
  """
  # 1. Sampling
  inputs = tokenizer(context, return_tensors="pt")
  # Generate samples using multinomial sampling
  # Ensure output_scores=True to get token probabilities
  outputs = model.generate(
      inputs.input_ids,
      do_sample=True,
      num_return_sequences=num_samples,
      temperature=temperature,
      max_length=inputs.input_ids.shape[1] + max_len,
      output_scores=True,
      return_dict_in_generate=True,
      pad_token_id=tokenizer.eos_token_id
  )

  sequences = outputs.sequences
  # Calculate sequence probabilities (requires careful handling of scores)
  # seq_log_probs = calculate_sequence_log_probs(outputs.scores, sequences, inputs.input_ids.shape[1]) # Function needed
  # For simplicity here, assume we have sequences and can get probs later if needed for weighted version
  # For count-based version, just need the decoded text:
  decoded_sequences = tokenizer.batch_decode(sequences[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

  # 2. Semantic Clustering (using a pre-loaded NLI model/pipeline)
  nli_pipeline = # Load your NLI pipeline (e.g., from Hugging Face)
  clusters = [] # List of lists, each inner list is a cluster of indices
  clustered_indices = set()

  for i in range(num_samples):
    if i in clustered_indices:
      continue
    current_cluster = [i]
    clustered_indices.add(i)
    for j in range(i + 1, num_samples):
      if j in clustered_indices:
        continue

      # Perform bidirectional entailment check
      premise1 = context + " " + decoded_sequences[i]
      hypothesis1 = context + " " + decoded_sequences[j]
      premise2 = context + " " + decoded_sequences[j]
      hypothesis2 = context + " " + decoded_sequences[i]

      # Note: Real implementation needs careful NLI input formatting
      res1 = nli_pipeline({'text': premise1, 'text_pair': hypothesis1})
      res2 = nli_pipeline({'text': premise2, 'text_pair': hypothesis2})

      # Check if entailment holds in both directions (using NLI output labels/scores)
      if res1['label'] == 'ENTAILMENT' and res2['label'] == 'ENTAILMENT': # Adjust based on NLI output format
         current_cluster.append(j)
         clustered_indices.add(j)

    clusters.append(current_cluster)

  # 3. Entropy Calculation (Count-based Approximation)
  num_clusters = len(clusters)
  if num_clusters <= 1:
      return 0.0 # No uncertainty if only one semantic meaning

  cluster_probs = [(len(cluster) / num_samples) for cluster in clusters]
  entropy = -sum(p * math.log(p) for p in cluster_probs if p > 0)

  return entropy

Experimental Evaluation and Results

The effectiveness of SE was evaluated on free-form QA using CoQA (conversational) and TriviaQA (open-domain) datasets with OPT models ranging from 125M to 30B parameters. Uncertainty was measured by its ability to predict correctness (Rouge-L > 0.3 threshold) via the Area Under the Receiver Operating Characteristic curve (AUROC).

  • Performance: SE consistently achieved higher AUROC scores than baselines including PE, LN-PE, p(True), and Lexical Similarity (average Rouge-L within samples). On TriviaQA with OPT-30B, SE achieved an AUROC of 0.78, compared to ~0.72 for LN-PE and ~0.65 for PE (Figure 1a). Similar gains were observed on CoQA.
  • Scaling: The performance gap between SE and baselines generally widened with increasing model size (Figure 1b, Figure 5). This suggests SE becomes increasingly valuable as models become more capable but potentially harder to interpret.
  • Mechanism: Analysis confirmed that incorrect answers tended to generate a higher number of distinct semantic clusters compared to correct answers (Table 2), supporting the intuition that semantic dispersion correlates with error. While simply counting clusters showed some predictive power, SE (weighting clusters by probability) performed better.
  • Sample Efficiency: SE demonstrated better utilization of samples compared to PE/LN-PE, with its AUROC improving more steeply as the number of samples NN increased from 10 to 100 (Figure 6a).
  • Hyperparameters: Optimal performance was found at intermediate sampling temperatures (e.g., T=0.5T=0.5), significantly outperforming higher temperatures (T=1.0T=1.0) often used in prior work evaluating PE (Figure 6b). Multinomial sampling yielded better results than multinomial beam search (Table 4).

Practical Applications and Limitations

Semantic Entropy provides a more nuanced understanding of LLM uncertainty in generation tasks. Potential applications include:

  • Reliability Assessment: Identifying potentially incorrect or unreliable answers in QA systems or chatbots. High SE could flag outputs needing verification.
  • Selective Generation: In critical applications, the system could abstain from answering or request clarification if SE exceeds a certain threshold.
  • Hybrid Systems: High SE outputs could be deferred to a human expert or a different, perhaps more conservative, system.
  • Model Diagnostics: Analyzing patterns in SE across different inputs or domains can help diagnose systematic weaknesses or biases in the LLM.

Limitations include:

  • Computational Cost: The O(N2)O(N^2) complexity of the pairwise NLI clustering step can be prohibitive for large NN.
  • NLI Model Dependency: The quality of SE hinges on the accuracy and robustness of the chosen NLI model. Errors or biases in the NLI model will propagate to the uncertainty estimate. The NLI model must also handle the potentially varied phrasing and length generated by the LLM appropriately within the context xx.
  • Sampling Artifacts: Like all sampling-based methods, the estimate is sensitive to the number of samples NN and the sampling temperature TT. Poor choices can lead to inaccurate uncertainty estimates.

Conclusion

Semantic Entropy offers a principled approach to uncertainty estimation in NLG by explicitly addressing the challenge of semantic equivalence. By clustering sampled outputs based on meaning using bidirectional NLI entailment and calculating entropy over these semantic clusters, SE provides a measure that correlates better with model accuracy on QA tasks compared to standard sequence-level entropy measures, particularly for large-scale models. While computationally more intensive than basic predictive entropy due to the clustering step, its improved performance suggests it is a valuable tool for assessing the reliability of LLM-generated text in practical applications.