The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity (2511.04418v1)

Published 6 Nov 2025 in cs.LG and cs.CL

Abstract: Accurate uncertainty quantification (UQ) in LLMs is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.

Summary

The paper demonstrates that current UQ methods excel in zero-aleatoric settings but collapse when ambiguity introduces multiple plausible answers.
It introduces two novel QA datasets, MAQA* and AmbigQA*, to evaluate UQ estimators under realistic, ambiguous conditions using factual co-occurrence statistics.
The findings suggest that both post-hoc UQ approaches and internal representation probes are unreliable in ambiguous scenarios, highlighting the need for uncertainty-aware training.

Uncertainty Quantification in LLMs: Failure Modes Under Ambiguity

Introduction and Motivation

The paper "The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity" (2511.04418) rigorously investigates the reliability of uncertainty quantification (UQ) methods in LLMs when faced with ambiguous question-answering (QA) tasks. The central thesis is that while current UQ estimators—predictive variation, internal representation probes, and ensemble-based methods—perform adequately in settings with zero aleatoric uncertainty, their efficacy collapses in the presence of ambiguity, i.e., when multiple plausible answers exist for a given question. This is a critical issue for deploying LLMs in real-world, high-stakes domains where ambiguity is inherent.

Theoretical Foundations: Aleatoric vs. Epistemic Uncertainty

The paper formalizes total uncertainty (TU) in LLM predictions as the cross-entropy between the true answer distribution $p^*$ and the model's predicted distribution $p$ . This decomposes into aleatoric uncertainty (AU), the entropy of $p^*$ , and epistemic uncertainty (EU), the KL-divergence $\mathrm{KL}(p^* \| p)$ . AU is irreducible and reflects intrinsic data ambiguity, while EU is reducible and reflects model ignorance.

Figure 1: Theoretical insights on the 3-class simplex; under zero AU, high entropy in $p$ guarantees high EU, but under non-trivial AU, entropy in $p$ is uninformative about EU.

The theoretical analysis demonstrates that in the zero-AU regime, EU reduces to the negative log-likelihood of the correct answer, and predictive entropy or mutual information (MI) from ensembles are reliable proxies for EU. However, when AU is non-zero, the location of $p^*$ in the probability simplex is unconstrained, and no function of $p$ alone can reliably distinguish epistemic from aleatoric uncertainty.

Benchmarking Under Ambiguity: MAQA* and AmbigQA*

To enable principled evaluation of UQ under ambiguity, the authors introduce MAQA* and AmbigQA*, two QA datasets with explicit ground-truth answer distributions $p^*$ , estimated via factual co-occurrence statistics in large corpora (primarily English Wikipedia). This frequentist approach is justified by empirical correlations between co-occurrence and LLM output probabilities, and by theoretical arguments that, in the infinite data limit, model predictions should converge to the pretraining distribution.

Figure 2: Left: Distribution of ground-truth entropy $H(p^*)$ across MAQA

and AmbigQA*; Right: JS divergence between different proxies for estimating

p^*

, indicating high alignment.*

The datasets span a wide range of AU, enabling systematic paper of UQ estimators in both unambiguous and ambiguous regimes.

Empirical Results: Collapse of UQ Estimators Under Ambiguity

The paper evaluates three families of UQ estimators:

Predictive Variation: Semantic Entropy (SE), Maximum Sentence Probability (MSP), Shifting Attention to Relevance (SAR), and Iterative Prompting (IP).
Internal Representations: Linear and MLP probes on residual stream activations.
Ensembles: MI computed over predictions from LLaMA3.1 8B, Gemma3 12B, and Qwen2.5 14B.

In the zero-AU setting (e.g., TriviaQA), all estimators achieve high concordance ( $AUC_c$ ) scores, reliably ranking samples by EU. However, in MAQA* and AmbigQA*, $AUC_c$ scores for all estimators degrade to near-random (0.5–0.6), indicating a failure to distinguish high and low EU.

Figure 3: Relationship between prediction-based estimators and true EU for Gemma 3-12B on MAQA

; correlation vanishes under non-trivial AU, and ROC curves approach random performance.*

Theoretical results (Proposition: Non-Identifiability of EU) prove that for any function $f(p)$ , there exist $p^*_1$ and $p^*_2$ such that EU is either zero or large, making $f(p)$ uninformative about EU under ambiguity. Similarly, MI from ensembles is shown to be unreliable: high MI does not imply high EU when AU is non-trivial.

Internal Representation Probes: No Reliable Signal Under Ambiguity

Empirical analysis of linear and MLP probes on model activations reveals that, while deeper layers encode EU in the zero-AU regime, probe performance collapses under ambiguity. This suggests that model internals do not retain additional signal for EU beyond what is present in the predictive distribution.

Figure 4: MLP regression performance across layers; probe ranking capability collapses under non-trivial AU.

Figure 5: MLP classification performance across layers; separation capability collapses under non-trivial AU.

Robustness and Ablations

The findings are robust across different $p^*$ estimation strategies (Wikipedia, RedPajama-V1, The Pile), model sizes, and perturbations of $p^*$ via Dirichlet priors. Notably, instruct models exhibit entropy collapse, outputting near-deterministic answers even when AU is high, further degrading UQ estimator performance.

Figure 6: Entropy collapse of Instruct models on MAQA

and AmbigQA*.*

Figure 7: Comparison of retrieved ground-truth distribution $p^*$ using different strategies; low JS divergence validates consistency.

Implications and Future Directions

The paper's results have significant implications:

Current UQ paradigms are fundamentally unreliable under ambiguity. This is both empirically and theoretically substantiated.
Post-hoc UQ methods are insufficient. Reliable EU estimation in the presence of AU requires models to be explicitly trained to encode uncertainty, potentially via higher-order or evidential approaches.
Benchmarking must account for ambiguity. The release of MAQA* and AmbigQA* enables rigorous evaluation of future UQ methods in realistic settings.

The authors suggest that future work should focus on training LLMs to model joint distributions over answers, or to learn second-order uncertainty representations, as in evidential deep learning or higher-order calibration frameworks.

Conclusion

This paper provides a comprehensive theoretical and empirical analysis of the failure modes of uncertainty quantification in LLMs under ambiguity. The introduction of new benchmarks and the demonstration of estimator collapse highlight a critical gap in current methodologies. The work motivates a paradigm shift toward uncertainty-aware training and evaluation, with direct implications for the safe and trustworthy deployment of LLMs in ambiguous, real-world tasks.