Why Uncertainty Estimation Methods Fall Short in RAG: An Axiomatic Analysis

Published 12 May 2025 in cs.IR | (2505.07459v2)

Abstract: LLMs are valued for their strong performance across various tasks, but they also produce inaccurate or misleading outputs. Uncertainty Estimation (UE) quantifies the model's confidence and helps users assess response reliability. However, existing UE methods have not been thoroughly examined in scenarios like Retrieval-Augmented Generation (RAG), where the input prompt includes non-parametric knowledge. This paper shows that current UE methods cannot reliably assess correctness in the RAG setting. We further propose an axiomatic framework to identify deficiencies in existing methods and guide the development of improved approaches. Our framework introduces five constraints that an effective UE method should meet after incorporating retrieved documents into the LLM's prompt. Experimental results reveal that no existing UE method fully satisfies all the axioms, explaining their suboptimal performance in RAG. We further introduce a simple yet effective calibration function based on our framework, which not only satisfies more axioms than baseline methods but also improves the correlation between uncertainty estimates and correctness.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper shows that current uncertainty estimation methods in RAG lower uncertainty even with irrelevant evidence, violating expected behavior.
It introduces five axioms that formalize how retrieved evidence should affect uncertainty, enabling principled calibration.
Empirical calibration using NLI and semantic metrics restores discrimination power, improving AUROC across diverse datasets.

Axiomatic Analysis of Uncertainty Estimation Methods in Retrieval-Augmented Generation

Motivation and Problem Formulation

Research into Uncertainty Estimation (UE) for LLMs is motivated by the need to reliably quantify the confidence of generative outputs, which is critical for applications where outcome correctness varies, especially due to the sources of knowledge (parametric vs. non-parametric). In Retrieval-Augmented Generation (RAG), the prompt combines user queries with supplementary documents, dynamically altering the model's fact base. This paradigm introduces unique challenges for UE: the injected context can change not only the output but also the response's grounding, making prior UE methods—often tuned for closed-book setups—potentially invalid or unreliable.

The paper undertakes a formal investigation of why current UE methods break down in RAG scenarios. The primary claim is that these methods lack principled handling of how retrieved evidence impacts uncertainty, and that their outputs often violate basic intuitive properties when external context is added.

Figure 1: Axes for comparing uncertainty with and without RAG, showing that additional supporting evidence should decrease uncertainty—a foundational principle for evaluating UE behavior.

Taxonomy and Limitations of Existing UE Approaches

The authors study both white-box and black-box UE approaches:

White-box: Predictive Entropy (PE) and Semantic Entropy (SE) rely on token probability distributions and cluster similarity over generations. Length normalization and meaning-awareness (MARS, TokenSAR) refine scaling for output length and token importance.
Black-box: Methods such as sum of Laplacian eigenvalues (EigV), degree matrix (Deg), and embedding eccentricity (ECC) quantify semantic dispersion in output samples.

A core empirical finding is that existing UE scores systematically decrease when any document is introduced—even irrelevant ones—contradicting the expectation that unrelevant or conflicting context should raise uncertainty. This is demonstrated in both synthetic and real retriever settings, and across multiple LLM families (Llama2-chat, Mistral-v0.3).

Figure 2: AUROC comparison for Llama2-chat showing a decline in uncertainty discrimination (AUROC) in the RAG setting, except when gold contexts are used.

The central critique is that these UE methods exhibit a bias toward artificially low uncertainty in RAG setups, reflecting their inability to properly reason over the external context's effect on output validity.

Axiomatic Framework: Desiderata for UE in RAG

To characterize ideal UE behavior in RAG, the paper proposes five formal axioms governing the relationship between uncertainty, the LLM's parametric belief, and the entailed/contradicted status of the retrieved document:

Positively Consistent: If RAG confirms the original answer, uncertainty must decrease.
Negatively Consistent: If RAG contradicts the original answer, uncertainty must increase.
Positively Changed: If RAG causes a correction toward the truth, uncertainty must decrease.
Negatively Changed: If RAG causes a change toward error, uncertainty must increase.
Neutrally Consistent: If RAG is irrelevant and output is unchanged, uncertainty must not shift.

These axioms are instantiated via ground-truth reference matching (reference-based) or with entailment/contradiction/independence reasoning (reference-free using NLI models).

Empirical measurements, using these axioms with representative datasets (NQ-open, TriviaQA, PopQA), show that no existing UE method satisfies all five, with systematic failures on axioms 2, 4, and 5. Specifically, uncertainty is incorrectly lowered even for contradicted or irrelevant contexts.

Calibration-Based Correction: Axiomatic Enhancement of UE

The theoretical structure is leveraged to construct a calibration function for UE scores, explicitly adjusting them according to axiom satisfaction. The calibrated UE score for RAG responses is defined as:

$\mathcal{U}^{\mathrm{cal}}(M_\theta(q, c), r_2) = (k_4 - \alpha_{\text{ax}}) \cdot \mathcal{U}(M_\theta(q, c), r_2)$

where $\alpha_{\text{ax}}$ combines factors for response equivalence and entailment/contradiction assessments. Three instantiations for the critical $\mathcal{R}(c, q, r)$ function are explored: CTI (KL-div between token dists), NLI entailment, and MiniCheck fact-checking.

Calibrated UE scores reliably align with axioms and improve global AUROC—often fully recovering or surpassing the discrimination power of closed-book (no doc) baselines.

Figure 3: After calibration, RAG AUROC either matches or exceeds closed-book baselines over TriviaQA, demonstrating successful correction.

Figure 4: AUROC calibration effects for Llama2-chat on NQ-open and PopQA datasets, showing significant improvements in uncertainty-correctness correlation.

Figure 5: Effect of calibration for Mistral-v0.3: AUROC of RAG settings improves or matches closed-book values after axiom-informed adjustment.

A consistent trend is observed: as the proportion of samples satisfying the axioms increases, so does AUROC, indicating that principled correction is not just theoretically sound but has practical effect on uncertainty discrimination.

Practical and Theoretical Implications

From an operational perspective, the findings reveal substantial risk in deploying uncalibrated or naively applied UE methods in RAG pipelines—systems may unjustly appear more confident, undermining trust and downstream decision processes. For any real-world usage where evidence integration matters (e.g., medical Q&A, adaptive retrieval, model abstention), UE must be scrutinized for axiom conformity.

The calibration formula is model-agnostic and dataset-independent, requiring only validated NLI/semantic scoring components, and the hyperparameters are empirically stable across LLM/RAG variants. Resource requirements are moderate, with the main cost incurred during grid search for calibration coefficients ( $\sim$ 250 GPU hours for the full experimental matrix). The framework also opens up the possibility for future direct “axiomatic” UE models, potentially bypassing calibration and directly satisfying desirable properties by architectural or training design.

Future Directions and Limitations

The authors highlight several open research directions:

Integration of axioms into trainable UE function design, potentially enabling first-principles estimation without post-hoc adjustment.
Extension to multimodal RAG, long-form responses, and adaptive RAG-based selection strategies.
Further expansion of axioms to cover complex semantics (e.g., multiple contradictory contexts, nuanced entailment relations).

Current limitations include reliance on the quality and coverage of NLI/fact-checking modules, and the focus on short-form factual QA tasks.

Conclusion

The paper provides a rigorous axiomatic foundation for evaluating UE approaches in RAG, demonstrating that existing methods fail to meet basic correctness constraints and offering a practical, effective calibration mechanism. As retrieval augmentation becomes central in knowledge-intensive systems, principled uncertainty reasoning will be critical for reliable, trustworthy LLM deployment.

Markdown Report Issue