Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Entropy: Quantifying Uncertainty

Updated 4 March 2026
  • Semantic Entropy is an information-theoretic measure that quantifies unpredictability by grouping outputs into semantic equivalence classes rather than relying solely on surface forms.
  • It employs methods like NLI-based and embedding-based clustering to aggregate probabilities over paraphrases, ensuring robust uncertainty assessment in various model outputs.
  • SE is applied across domains—from language models and video-QA to time series and semantic communications—to detect hallucinations and guide adaptive inference.

Semantic Entropy (SE) is a rigorously defined information-theoretic quantity that extends classical Shannon entropy to quantify uncertainty or diversity at the level of semantic equivalence—not, crucially, at merely the lexical, symbolic, or syntactic level. Across its major instantiations, SE has been used to assess model confidence, guide decision processes in neural networks, decompose linguistic redundancy, characterize time series, and optimize communication protocols. While definitions and operationalizations differ across domains, the unifying principle is always the measurement of unpredictability, diversity, or ambiguity over units representing “meaning” rather than surface form.

1. Theoretical Foundations and Formal Definitions

The canonical definition of semantic entropy is as follows. Let S\mathcal{S} be the set of possible model outputs (token sequences), and define a semantic equivalence relation E(s,s)E(s,s') that partitions S\mathcal{S} into a set of equivalence classes C={c1,,cK}\mathcal{C} = \{c_1,\ldots,c_K\}, where each ckc_k consists of all utterances with identical semantic content under EE. For a model distribution p(sx)p(s|x) (input xx), the probability mass assigned to class cc is

p(cx)=scp(sx).p(c|x) = \sum_{s \in c} p(s|x).

The semantic entropy is then: SE(x)=cCp(cx)logp(cx).\mathrm{SE}(x) = -\sum_{c \in \mathcal{C}} p(c|x)\,\log p(c|x). This construction ensures that paraphrases or alternative wordings have their probability masses aggregated, yielding a metric fundamentally invariant to surface form (Kuhn et al., 2023).

In practice, p(cx)p(c|x) is approximated via Monte Carlo: sample MM outputs s(1),,s(M)s^{(1)},\dots,s^{(M)} from p(sx)p(s|x), group into semantic clusters C1,,CKC_1,\dots,C_K via bidirectional entailment (or other clustering), estimate p(Ckx)p(C_k|x) by summing sequence probabilities, and compute

SE(x)^=k=1K(sCkp(sx))log(sCkp(sx)).\widehat{\mathrm{SE}(x)} = -\sum_{k=1}^K \bigg(\sum_{s \in C_k} p(s|x)\bigg)\,\log\bigg(\sum_{s \in C_k} p(s|x)\bigg).

Variants such as the discrete approximation (using cluster sample frequencies) and continuous versions (using model probabilities) are operationally important (Penny-Dimri et al., 1 Mar 2025).

Domain-specific forms exist: for time series, SE is the entropy of the empirical distribution of local geometric patterns (see Section 6) (Majumdar et al., 2016); for semantic chunking, SE is the entropy rate of a hierarchical tree of semantic units (Zhong et al., 13 Feb 2026); in communications, SE is the minimal expected code length such that semantic task performance is preserved (Rong et al., 2024).

2. Methodologies for Computing Semantic Entropy

In LLMs and multimodal systems, computing SE entails:

  1. Sampling Outputs: Draw multiple high-temperature generations for a given (text or video) prompt (Gautam et al., 13 Jan 2026, Kuhn et al., 2023).
  2. Clustering by Meaning:
  3. Estimating Cluster Probabilities: Sum model probabilities (possibly normalized by length) for all members of a cluster.
  4. Entropy Calculation: Compute kpklogpk-\sum_k p_k \log p_k over the clusters.

In specialized applications, modifications may arise:

  • In video or medical VQA, outputs under perturbed inputs are clustered, and cluster probabilities adapt to reflect robustness or sensitivity to the input (Gautam et al., 13 Jan 2026, Liao et al., 26 Mar 2025).
  • In time-series, neighborhood patterns map directly to a finite symbol set, and SE is the entropy of observed configuration frequencies (Majumdar et al., 2016).

Algorithmic and computational considerations include the quadratic cost of NLI-based clustering (O(M2)O(M^2) for MM samples), the ability to parallelize embedding-based clustering, and tradeoffs between plug-in estimators and coverage-corrected or spectral methods for small sample settings (McCabe et al., 17 Sep 2025).

3. Practical Role: Uncertainty Quantification and Hallucination Detection

SE has gained traction as an intrinsic and unsupervised quantification of semantic uncertainty for language and multimodal models. SE measures how probability mass is dispersed across distinct semantic hypotheses, allowing it to:

Empirical results consistently show that SE outperforms token-level predictive entropy and self-evaluation baselines for hallucination detection, as measured by AUROC (e.g., AUROC ~0.83 vs 0.80 for normalized entropy on TriviaQA; ~0.76 for SE vs 0.62 for perplexity on clinical QA) (Kuhn et al., 2023, Penny-Dimri et al., 1 Mar 2025). SE remains robust with modest sample sizes (often M<20M<20), and its power grows with model size (Kuhn et al., 2023).

In safety-critical clinical settings, discrete or continuous SE achieves near-perfect uncertainty discrimination under expert review (AUROC ~0.97) even when clustering is imperfect (Penny-Dimri et al., 1 Mar 2025).

4. Limitations, Extensions, and Generalizations

While SE is conceptually powerful, several limitations motivate recent extensions:

  • Degeneracy for Deterministic Outputs: If all samples coalesce in a single semantic cluster (K=1K=1), SE evaluates to zero regardless of correctness. This “single-cluster failure” means SE is sensitive only to aleatoric uncertainty, not epistemic uncertainty (model ignorance) (Ma et al., 20 Aug 2025).
  • Neglect of Intra- and Inter-Cluster Similarity: Hard clustering treats all clusters as maximally distinct, ignoring proximity between semantically similar clusters or spread within a cluster. This reduces effectiveness for one-sentence outputs or settings with near-unique generations (Nguyen et al., 30 May 2025, Nikitin et al., 2024).
  • Sample Coverage Bias: Plug-in estimators tend to underestimate true semantic entropy when the support (the “semantic alphabet”) is only partially sampled. Coverage correction using Good–Turing, spectral graph, or hybrid estimators improves bias and downstream performance (McCabe et al., 17 Sep 2025).

Key generalizations include:

  • Kernel Language Entropy (KLE): Replaces hard clusters with a positive-semidefinite semantic similarity kernel; uncertainty is quantified as von Neumann entropy, recovering SE as a special case for block-diagonal kernels (Nikitin et al., 2024).
  • Semantic Nearest Neighbor Entropy (SNNE): Dispenses with clustering, estimating entropy via LogSumExp of pairwise semantic similarities, smoothing over intra- and inter-cluster structure (Nguyen et al., 30 May 2025).
  • Structural Semantic Entropy (SeSE): Encodes semantic output space as a directed, sparsified semantic graph (using NLI entailment strengths), then computes graph-structural entropy over optimal hierarchical encoding trees; yields substantially improved detection especially for long-form outputs, outperforming both SE and KLE empirically (Zhao et al., 20 Nov 2025).
  • SE Probes (SEP): Linear probes trained on internal model states can predict entropy class (high/low SE) at negligible cost and with substantial generalization in out-of-distribution tasks (Kossen et al., 2024).

5. Domain-Specific Instantiations

Language and Vision-LLMs

In NLG, SE operates on autoregressive LMs, multimodal LLMs, and video-VLMs:

  • Text LMs: SE clusters model generations by paraphrase equivalence; high entropy reflects uncertainty over “possible truths,” not over wording (Kuhn et al., 2023).
  • Medical VQA/Video VLMs: SE generalizes to spatiotemporal perturbations and visual contexts; in VideoHEDGE, cluster probabilities are computed from both clean and perturbed video-generated answers, capturing the effect of visual support on semantic stability (Gautam et al., 13 Jan 2026, Liao et al., 26 Mar 2025).
  • Compositional Reasoning/Inference: In multi-round parallel reasoning frameworks, SE is a “stop” signal, reflecting when the semantic diversity of candidate solutions drops, aiding adaptive compute allocation (Xu et al., 9 Jul 2025).

Time Series and Signal Analysis

In geometric signal frameworks, semantic entropy is computed over the frequency of local geometric configurations in the time series (13 possible patterns for 3-point neighborhoods) (Majumdar et al., 2016, Majumdar et al., 2018). SE quantifies “shape complexity”: regular signals or constant slopes yield SE = 0, while maximal diversity (as in white noise) gives SE ≈ log213\log_2 13. The SE-to-information-power ratio characterizes phenomena such as synchrony in EEG (Majumdar et al., 2018).

Semantic Communications

In deep learning-based semantic communications, SE is the minimum expected number of “semantic symbols” required to achieve task-level fidelity, operationalizing compression and channel resource allocation (Rong et al., 2024). Adaptive channel assignment and semantic key generation for physical-layer security leverage SE as a guiding metric.

Statistical Structure of Natural Language

In a formal model of natural language, SE is the entropy rate of the random ensemble of hierarchical semantic chunkings—a direct, first-principles explanation of empirical redundancy rates in English (≈1 bit/character), modulated by a single parameter (max branching factor KK) controlling semantic complexity (Zhong et al., 13 Feb 2026).

6. Illustrative Table: SE Across Representative Domains

Domain Input Objects Semantic Unit / Cluster SE Formula Example
LLMs/QA (Kuhn et al., 2023) Text strings (completions) Paraphrase clusters (by NLI) kpklogpk-\sum_k p_k \log p_k over clusters
Video-VLMs (Gautam et al., 13 Jan 2026) Answer texts (video QA) Output groups (embedding/NLI) jpjlogpj-\sum_j p_j \log p_j with pjp_j from log-likelihood sums
Time series (Majumdar et al., 2016) Signal, 3-point windows 13 geometric config. patterns i=113pilog2pi-\sum_{i=1}^{13} p_i \log_2 p_i
Semantic comms (Rong et al., 2024) Feature maps Chosen feature subset (by wicw_i^c) exp. number of features λ\lambda for task fidelity
Language structure (Zhong et al., 13 Feb 2026) Doc tokens Chunk/tree branches (KK-ary) Entropy rate hKh_K of semantic tree ensemble

7. Impact, Benchmarks, and Empirical Behavior

Semantic Entropy has become an anchor metric in model reliability and uncertainty quantification research:

  • Benchmark performance: SE (white- or black-box) achieves strong discrimination of correct vs. incorrect predictions in QA, summarization, and translation (AUROC up to ~0.83 with few samples) (Kuhn et al., 2023, Penny-Dimri et al., 1 Mar 2025, McCabe et al., 17 Sep 2025).
  • Calibration: In held-out or expert settings, SE remains robust even when perplexity or token-level entropy does not correlate with real-world correctness (Penny-Dimri et al., 1 Mar 2025).
  • Video/vision: SE, while conceptually expressive, sometimes fails to flag high-confidence hallucinations when models output paraphrases of a single grounded (or ungrounded) answer; vision-amplified variants like VASE that explicitly contrast clean and perturbed inputs outperform plain SE (Gautam et al., 13 Jan 2026, Liao et al., 26 Mar 2025).
  • Algorithmic efficiency: Embedding-based clustering delivers SE estimates nearly matching NLI-based results but at orders of magnitude lower computational cost (Gautam et al., 13 Jan 2026); SE probes can infer high-vs-low SE at inference-time for zero extra sampling (Kossen et al., 2024).
  • Contextual limitations: In high-accuracy, short-generation settings, intra-cluster similarity and the possibility of “semantic collapse” (all outputs identical yet wrong) require richer generalizations (e.g., SNNE, KLE, SeSE) (Nguyen et al., 30 May 2025, Zhao et al., 20 Nov 2025, Nikitin et al., 2024).

References

Semantic entropy thus constitutes a central pillar in modern uncertainty quantification, with a growing set of variants designed to address its theoretical and practical limits. Empirical experience across diverse domains supports its value as an unsupervised, interpretable, and extensible metric attuned to the semantics of information, not merely its symbolism.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Entropy (SE).