Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Nearest Neighbor Entropy (SNNE)

Updated 25 May 2026
  • SNNE is an uncertainty metric that measures semantic diversity in LLM outputs by evaluating pairwise similarities among generated answers.
  • It leverages a nearest-neighbor framework with log-sum-exp aggregation to capture both intra-cluster coherence and inter-cluster variation.
  • Empirical results in LLM and VQA applications indicate that SNNE improves hallucination detection and safety-critical decision-making.

Semantic Nearest Neighbor Entropy (SNNE) is an uncertainty quantification metric for LLMs that generalizes cluster-based semantic entropy measures by leveraging pairwise semantic similarity in a nearest-neighbor framework. SNNE overcomes limitations in previous entropy metrics by simultaneously capturing both intra-cluster and inter-cluster semantic variation, making it particularly suited for detecting hallucination and semantic ambiguity in LLMs, as well as for safety-critical tasks such as surgical visual question answering (VQA).

1. Formal Definition

SNNE is conceptually inspired by continuous-space nearest-neighbor entropy estimators, adapted for discrete answer sets produced by LLMs. For a prompt qq and a set of nn sampled answers {a1,,an}\{a^1, \ldots, a^n\}, a pairwise similarity function f(ai,ajq)f(a_i,a_j \mid q) quantifies the semantic closeness between answers, with higher values indicating greater semantic proximity.

The black-box SNNE for prompt qq is defined as: SNNE(q)=1ni=1nlog[j=1nexp(1τf(ai,ajq))]SNNE(q) = -\frac{1}{n}\sum_{i=1}^n \log\left[ \sum_{j=1}^n \exp\left( \frac{1}{\tau} f(a^i, a^j \mid q) \right) \right] where τ>0\tau > 0 is a temperature parameter controlling the softness of the nearest-neighbor aggregation. The log-sum-exp operator yields a "soft" neighbor count, aggregating both strong (intra-cluster) and weak (inter-cluster) semantic relations.

A white-box extension (WSNNE) incorporates model-estimated answer sequence probabilities P~(aiq)\tilde P(a_i \mid q), yielding: WSNNE(q)=i=1nPˉ(aiq)log[j=1nexp(1τf(ai,ajq))]WSNNE(q) = -\sum_{i=1}^n \bar P(a^i \mid q) \log\left[ \sum_{j=1}^n \exp\left( \frac{1}{\tau} f(a^i, a^j \mid q) \right) \right] where Pˉ(aiq)\bar P(a^i \mid q) is the normalized probability over the nn0 samples.

For VQA and related applications, a variant definition operates on the entropy of a softmax-normalized vector of similarities between a candidate answer and its nn1 semantic nearest neighbors in embedding space: nn2 with nn3 computed, for example, via cosine similarity in text embedding space (Nguyen et al., 30 May 2025, Pierantozzi et al., 3 Nov 2025).

2. Algorithmic Pipeline

The standard SNNE computation involves the following steps:

  1. Sampling: Generate nn4 answers nn5 by sampling the LLM at temperature nn6 for a fixed prompt nn7.
  2. Pairwise Similarity: For each pair nn8 compute nn9 using a semantic similarity metric.
  3. Aggregation: For each {a1,,an}\{a^1, \ldots, a^n\}0, compute {a1,,an}\{a^1, \ldots, a^n\}1 and aggregate as specified above.
  4. White-box Extension: In settings with access to sequence probabilities, weight terms by normalized model probabilities.
  5. Output: The final SNNE or WSNNE score summarizes the model's semantic uncertainty for the prompt.

For black-box VQA applications, SNNE typically computes embedding-based similarities for each answer against its {a1,,an}\{a^1, \ldots, a^n\}2 nearest neighbors and then calculates the entropy of the resulting softmax distribution.

3. Theoretical Properties and Relationship to Prior Entropy Measures

SNNE is a strict generalization of Discrete Semantic Entropy (DSE) and Semantic Entropy (SE):

  • DSE Recovery: If answers are pre-clustered and {a1,,an}\{a^1, \ldots, a^n\}3 is set to {a1,,an}\{a^1, \ldots, a^n\}4 intra-cluster and {a1,,an}\{a^1, \ldots, a^n\}5 inter-cluster, SNNE reduces precisely to cluster-proportion entropy, i.e., DSE.
  • SE Recovery: With additional weighting by answer probabilities within clusters, WSNNE reduces to SE as formulated in prior LLM uncertainty literature.

These results show that for general {a1,,an}\{a^1, \ldots, a^n\}6, SNNE/WSNNE smoothly interpolate between hard-cluster entropy and graded, pairwise semantic dispersion, capturing fine-grained intra- and inter-cluster variations that are not accessible to classical cluster-based metrics (Nguyen et al., 30 May 2025).

4. Semantic Similarity Function Choices

The selection of the similarity function {a1,,an}\{a^1, \ldots, a^n\}7 is critical for SNNE's effectiveness. Practical instantiations include:

In clinical VQA, embedding spaces are often domain-adapted (e.g., PubMed-tuned sentence BERT); the similarity function can be further modulated to reflect question–answer relevance as in the QA-SNNE extension (Pierantozzi et al., 3 Nov 2025).

5. Question-Aligned SNNE and Modifications

Question-Aligned Semantic Nearest Neighbor Entropy (QA-SNNE) extends SNNE for applications where relevance to a specific query is paramount (e.g., safety-critical VQA):

  • Alignment Score Computation: For each neighbor answer, compute an alignment score {a1,,an}\{a^1, \ldots, a^n\}8 with respect to {a1,,an}\{a^1, \ldots, a^n\}9, using methods such as embedding cosine similarity, NLI entailment logits, or cross-encoder relevance models.
  • Gating Mechanism: Convert f(ai,ajq)f(a_i,a_j \mid q)0 into a relevance weighting f(ai,ajq)f(a_i,a_j \mid q)1 via a sharp softmax; similarities are then reweighted as f(ai,ajq)f(a_i,a_j \mid q)2.
  • QA-Conditioned Entropy: Compute a softmax over these gated similarities and sum the Shannon entropy: f(ai,ajq)f(a_i,a_j \mid q)3 where f(ai,ajq)f(a_i,a_j \mid q)4 normalizes the gated similarities.

This modification ensures that only semantically relevant neighboring answers, as determined by their alignment to the question, contribute to the uncertainty estimate. The method is robust to paraphrased or rephrased questions, maintaining uncertainty calibration in diverse clinical or high-stakes user inputs (Pierantozzi et al., 3 Nov 2025).

6. Empirical Results and Performance

SNNE and its variants have demonstrated competitive or superior performance in measuring LLM uncertainty:

  • Benchmarks: Across models such as Phi-3-mini, Llama-3.1-8B, MedGemma-4B, and Qwen2.5-VL, and on tasks including question answering, summarization, and translation, SNNE/WSNNE improve AUROC and PRR compared with SE and other baselines.
  • Clinical VQA: Application of (QA-)SNNE in surgical VQA increases AUROC for hallucination detection by 15–38 percentage points for zero-shot LVLMs, with robustness to out-of-template paraphrasing.
  • Similarity/Aggregation Choices: ROUGE-L is preferable for one-sentence QA, while domain-adapted embeddings are optimal in medical applications. Tuning f(ai,ajq)f(a_i,a_j \mid q)5 (number of neighbors) and f(ai,ajq)f(a_i,a_j \mid q)6 (temperature) shows results are stable for f(ai,ajq)f(a_i,a_j \mid q)7, f(ai,ajq)f(a_i,a_j \mid q)8.
  • Ablations: Model uncertainty estimates are sensitive to choice of f(ai,ajq)f(a_i,a_j \mid q)9 (generation temperature), qq0 (number of generations), and qq1, but diminishing returns are observed beyond qq2 samples (Nguyen et al., 30 May 2025, Pierantozzi et al., 3 Nov 2025).
Metric / Setting Baseline (DSE, SNNE, VL-U) QA-SNNE (Best Variant) AUROC Gain
Surgical VQA (in-template) up to 0.685 0.789 (CrossE) +15 pp
Paraphrase Robustness (Llama3) 0.74 (SNNE) 0.96 (Entail) +22 pp
External VQA (MedGemma) 0.540 (SNNE), 0.687 (VL-U) 0.755 (Emb-based) up to +21 pp

7. Implementation Practices, Practical Recommendations, and Limitations

  • Practical Usage: Recommended defaults are qq3–qq4, qq5, qq6, using ROUGE-L for short LLM generations and embedding-based cosine for longer/multi-sentence outputs. For models exposed in white-box settings, sequence probabilities should be used for WSNNE.
  • Efficiency: SNNE is qq7 in sample count but practically feasible at the recommended qq8. Precomputed embeddings and efficient similarity kernels can accelerate large-scale application.
  • Limitations: SNNE only interrogates the textual space of outputs and may not capture errors due to failures in image grounding for vision-LLMs. Further, hard gold-labeling with ROUGE-L may blur the distinction between semantic diversity and true hallucination, particularly under mild paraphrasing.
  • Future Work: Potential directions include multimodal SNNE variants that probe visual grounding, adaptive tuning of intrinsic hyperparameters for each question type, extension to multi-turn dialogue, and application to additional clinical domains (Nguyen et al., 30 May 2025, Pierantozzi et al., 3 Nov 2025).

In summary, Semantic Nearest Neighbor Entropy provides a principled, general, and empirically validated approach to semantic uncertainty estimation in LLMs and VQA systems, subsuming and strictly generalizing prior entropy-based metrics while enabling improved hallucination detection and safety-critical decision support.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Nearest Neighbor Entropy (SNNE).