Papers
Topics
Authors
Recent
Search
2000 character limit reached

Uncertainty Quantification in LLMs

Updated 30 March 2026
  • Uncertainty quantification in LLMs is the process of estimating model reliability by decomposing total uncertainty into aleatoric and epistemic components.
  • Techniques such as entropy correction, Monte Carlo sampling, and attention-based methods provide practical frameworks for assessing output confidence.
  • These methods enable improved risk assessment, calibration, and detection of hallucinated or erroneous responses in diverse applications.

Uncertainty quantification (UQ) in LLMs refers to systematically estimating the confidence or reliability of model outputs, particularly in generative or open-set scenarios. UQ identifies when LLM predictions are likely to be erroneous, incomplete, or hallucinated, serving as a critical tool for downstream applications in risk-sensitive environments, selective deferral, and model comparison. Unlike traditional classification, uncertainty in LLMs often integrates both epistemic (model/parameter) and aleatoric (data/sampling) components, and must account for the vast and high-dimensional structure of autoregressive output distributions. This article synthesizes leading methodologies, theoretical underpinnings, and empirical findings from recent research, emphasizing both practical frameworks and open challenges.

1. Theoretical Foundations of Uncertainty in LLMs

Modern LLMs generate outputs via sampling from autoregressive distributions over large, discrete sequence spaces. Unlike standard neural classifiers, uncertainty estimation for LLMs must typically consider:

Formally, let p(sx)p(s\,|\,x) denote the joint probability for a full output sequence ss conditioned on prompt xx. The canonical measure of total uncertainty is the Shannon entropy H(x)=sSp(sx)logp(sx)H^*(x) = -\sum_{s\in S} p(s|x)\log p(s|x). Sampling-based and structural approximations leverage this formulation as a starting point (Kunitomo-Jacquin et al., 6 Oct 2025).

2. Entropy-Based and Sampling-Driven UQ

The entropy of the LLM’s output sequence distribution is a standard metric for UQ but is intractable for large SS; thus, it must be approximated from samples. The role of unobserved output sequences—those with nonzero model probability yet unsampled—proves crucial (Kunitomo-Jacquin et al., 6 Oct 2025):

  • Monte Carlo entropy estimate:

H^naive(x)=sAp^(sx)logp^(sx)\hat{H}_\text{naive}(x) = -\sum_{s\in A} \hat{p}(s|x)\log \hat{p}(s|x)

where AA is the set of sequences observed at least once in MM draws and p^(sx)\hat{p}(s|x) is the renormalized probability over AA.

  • Correction for unobserved probability mass:

punseen=1sAp(sx)p_\text{unseen} = 1 - \sum_{s\in A} p(s|x)

Fold this into the entropy:

H^(x)=sAp(sx)logp(sx)punseenlogpunseen\hat{H}(x) = -\sum_{s\in A} p(s|x)\log p(s|x) - p_\text{unseen}\log p_\text{unseen}

Plug-in, Good–Turing, or Chao estimators adjust punseenp_\text{unseen} for low-sample regimes.

Experimental results on Falcon-40B and TriviaQA establish that correcting for missing mass (termed “EOS-UP”) stabilizes AUROC and improves calibration, especially for small MM (Kunitomo-Jacquin et al., 6 Oct 2025). Omitting EOS or length-normalizing degrades performance.

Practical recommendations: Always compute sequence probabilities with EOS, avoid length normalization unless separately scored, and incorporate missing-mass corrections for robust uncertainty even with few samples (Kunitomo-Jacquin et al., 6 Oct 2025).

3. Internal Model Cues and White-Box UQ

Internal model signals, especially from transformer attention maps, enable efficient, sampling-free UQ:

  • RAUQ (Recurrent Attention-based UQ) examines the dynamic pattern where attention to the immediate predecessor token in certain transformer heads drops when generating incorrect tokens (Vazhentsev et al., 26 May 2025). RAUQ automatically identifies uncertainty-aware attention heads and propagates per-token confidence recurrently (via a weighted sum of token probability and recurrent/confidence from the previous step), finally aggregating via the maximum sequence-level uncertainty across all layers.

This approach is entirely unsupervised, requiring only one forward pass and no additional labeling or sampling overhead. Empirically, RAUQ outperforms prior white- and black-box UQ baselines across QA, summarization, and translation tasks, with PRR improvements of up to +0.26 and less than 1% latency overhead (Vazhentsev et al., 26 May 2025).

  • Chain-of-Thought (CoT)-Enhanced UQ surfaces the model’s reasoning steps. CoT-UQ weights token-level or keyword-level probabilities by importance scores along CoT extractions, then aggregates these as response-wise uncertainty (Zhang et al., 24 Feb 2025). This approach directly addresses the LLM’s tendency for overconfidence by probing reasoning "weak points." Main AUROC improvements are averaged at +5.9% over standard UQ and are most pronounced in logical and mathematical reasoning tasks.
  • Intra- and Cross-Layer Pattern Analysis quantifies cross-layer agreement in the internal representations. By computing the directed layer-to-layer KL divergence of softmaxed hidden representations and flattening the resulting L×LL\times L matrices as uncertainty “signatures,” this method yields robust, transferable UQ that outperforms deep probes, especially under distribution- and quantization-shifts (Badash et al., 17 Mar 2026).

4. Semantic, Knowledge-Aware, and Geometric UQ

Semantic UQ moves beyond token-probabilities, evaluating the diversity and structure of model generations:

  • Cluster and Tensor-Based Semantic Uncertainty. Outputs are clustered via semantic similarity (e.g. cosine similarity in embedding space) and/or entailment, then entropy is computed over clusters. The multi-dimensional MD-UQ further decomposes uncertainty by constructing pairwise semantic and knowledge-aware similarity matrices, stacking these into a tensor, and applying CP/Tucker decomposition. The sum of low-rank reconstruction errors calibrates how “structured” (low uncertainty) or “scattered” the model’s answer space is. Empirical results confirm that fusing knowledge and semantic dimensions systematically outperforms one-dimensional UQ across QA tasks (Chen et al., 24 Feb 2025).
  • Structural Information via Directed Semantic Graphs (SeSE) encodes sampled responses as nodes in a directed, sparsified entailment graph. Hierarchical semantic structural entropy, minimized via optimal encoding trees, reflects the compressibility—and thus uncertainty—of the semantic space (Zhao et al., 20 Nov 2025). SeSE exhibits substantial AUROC and AURAC gains (up to +15%) compared to semantic entropy and kernel-based baselines, and extends to fine-grained claim-level UQ in long-form text.
  • Geometric Dispersion via Convex Hulls. Embedding LLM outputs (via BERT or similar) and quantifying the area/volume of the convex hull (post dimensionality reduction and clustering) provides a continuous, model-agnostic UQ metric. Larger hull areas reflect greater output dispersion and uncertainty, scaling with prompt complexity and temperature (Catak et al., 2024).

5. UQ under Model Adaptation, Ensembles, and Distillation

Parameter-efficient fine-tuning and ensemble-based approaches offer tractable posterior approximations and new axes of uncertainty estimates:

  • LoRA Deep Ensembles. Each ensemble member is a low-rank adapter trained from a shared pretrained base, approximating the Bayesian posterior over model parameters (Balabanov et al., 2024). Ensemble-averaged predictive entropy and mutual information (entropy minus mean ensemble-member entropy) decompose total, aleatoric, and epistemic uncertainty.

Empirical analysis on Mistral-7B, CommonsenseQA, and MMLU shows ensemble-based UQ lowers NLL and ECE relative to single-member models and surfaces retention of prior knowledge under overfitting.

  • Functional-Level UQ via Adapter-augmented MoE (UQ4CT). By hierarchically decomposing functional space via a mixture-of-experts during fine-tuning, epistemic uncertainty is defined as the expected variance between different expert outputs, weighted by gating probabilities. Functional-level calibration is enforced throughout training, yielding >25% ECE reduction across five QA tasks and maintaining robustness to distributional shift (Niu et al., 2024).
  • Distillation of Evidential UQ. Teacher ensembles or Bayesian prompt mixtures are distilled into single-pass student LLMs with softmax or Dirichlet (evidential) output heads. Dirichlet students directly model both mean and facial epistemic uncertainty via output concentration. Such students preserve or surpass teacher accuracy and ECE while incurring a 17× speed-up at inference (Nemani et al., 24 Jul 2025).
  • Bayesian Prompt as Parameter (Textual Bayes). Prompts are interpreted as textual model parameters θ\theta with an explicit (potentially language-defined) prior. Bayesian inference is performed using Metropolis–Hastings with LLM-driven proposals. The result is full posterior predictive uncertainty over both textual parameters and responses, which can be directly applied to black-box LLM APIs (Ross et al., 11 Jun 2025).

6. Perturbation-, Calibration-, and Black-Box Techniques

Perturbation- and aggregation-based frameworks and new calibration paradigms underpin robust UQ, especially when model internals are inaccessible:

  • SPUQ (Sampling with Perturbation for UQ) exposes epistemic uncertainty via prompt and temperature perturbations. Aggregating across diversity of model outputs using inter-sample (similarity-based) or intra-sample methods, SPUQ achieves 30–70% reduction in ECE relative to standard sampling or likelihood UQ across both closed and open API LLMs, while maintaining strong accuracy-calibration alignment (Gao et al., 2024).
  • Inv-Entropy (Song et al., 11 Jun 2025) formalizes UQ via the conditional entropy H(XY)H(X|Y) under a learned inverse probabilistic mapping from output samples YY back to perturbation-induced input distributions XX. This fully probabilistic, random-walk-based method, combined with genetic algorithm paraphrasing and engineered for trend monotonicity under temperature shifts, consistently achieves state-of-the-art calibration, PRR, and temperature sensitivity of uncertainty (TSU) across multiple QA benchmarks.
  • Data-Driven Conditional Dependency Learning (TAD). Model overconfidence due to neglected token dependencies is addressed by learning a regression model that predicts the gap between conditional and unconditional token probabilities, using attention features and previous confidence. This modulates token uncertainties in a single decoding pass, improving PRR by up to 0.5 across multiple generative tasks and yielding inexpensive, interpretable UQ (Vazhentsev et al., 2024).
  • Conformal and Calibration-Based UQ. Split conformal prediction using semantic clusters or negative log-likelihood scores provides distribution-free coverage guarantees for set-valued predictions. Empirical benchmarks across 45 model-task pairs confirm that even models with high raw accuracy can exhibit systematic over- or underconfidence, as reflected by predictive set size and calibration metrics (Kaur et al., 2024, Ye et al., 2024).

7. Benchmarks, Open Problems, and Future Directions

Standardized uncertainty metrics include AUROC, AURAC, PRR, ECE, Brier score, and conformal prediction set size and coverage. Calibration, robustness to model and task shift, and ability to disentangle sources of uncertainty (aleatoric vs epistemic, semantic vs knowledge) remain critical evaluation dimensions (Grewal et al., 2024, Huang et al., 25 Feb 2025).

Table: Representative UQ Frameworks, Methodology, and Domains

Method Main Principle Applicability / Highlights
Entropy + Missing Mass (Kunitomo-Jacquin et al., 6 Oct 2025) Sequence entropy corrected for unobserved sequences General, especially effective with few samples
RAUQ (Vazhentsev et al., 26 May 2025) Recurrent attention-based; no sampling White-box LLMs, real-time UQ
SPUQ (Gao et al., 2024) Input perturbation + sample aggregation Black-box or API-only LLMs
SeSE (Zhao et al., 20 Nov 2025) Semantic graph + hierarchical encoding Superior performance on hallucination detection
LoRA Ensemble (Balabanov et al., 2024) Ensemble posterior over fine-tuned adapters Parameter-efficient fine-tuning
Evidential Distillation (Nemani et al., 24 Jul 2025) Dirichlet student from ensemble teacher Single-pass, OOD detection
Inv-Entropy (Song et al., 11 Jun 2025) Inverse conditional entropy via random walks Theoretically grounded, trend-monotonic UQ
TAD (Vazhentsev et al., 2024) Regression of dependency-modulated token confidence Fast, compositional UQ in generation
MD-UQ (Chen et al., 24 Feb 2025) Tensor decomposition of semantic+fact similarity High-stakes, knowledge-rich domains
Benchmark (Ye et al., 2024) Conformal prediction + accuracy/uncertainty trade-offs Cross-model, cross-task benchmarking

Current directions entail extending semantic/structural UQ to long-form and open-ended generation, integrating structure-aware and attention-based UQ into production LLM APIs, adapting calibration strategies for prompt ensembles or multi-agent systems, and establishing richer benchmarks decomposing uncertainty along epistemic/aleatoric and semantic/knowledge axes.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Uncertainty Quantification in LLMs.