Uncertainty Quantification in Retrieval Augmented Question Answering (2502.18108v3)

Published 25 Feb 2025 in cs.CL

Abstract: Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at https://github.com/lauhaide/ragu.

Summary

The paper introduces a novel utility metric based on answer accuracy and entailment, demonstrating that high passage utility correlates with correct answers.
It applies a lightweight Siamese BERT model with a contrastive ranking objective to efficiently predict passage utility without expensive LLM fine-tuning.
Empirical results across multiple QA benchmarks show improved answer calibration, selective answering, and passage re-ranking performance.

Uncertainty Quantification in Retrieval Augmented Question Answering

Introduction and Motivation

The paper tackles the challenge of quantifying uncertainty in Retrieval Augmented Question Answering (RAQA) systems, which enhance QA model capabilities by incorporating externally retrieved textual evidence. Despite improvements in accuracy and reduction of hallucinations by these systems, a systematic assessment of the usefulness of retrieved passages—and their impact on answer correctness and uncertainty—has been largely absent. The authors propose a principled framework centered on the utility of individual passages to estimate the uncertainty of the final QA output, facilitating error detection and enabling more reliable QA deployment.

Methodological Framework

Passage Utility Estimation

The core hypothesis is that the utility of each retrieved passage is a direct proxy for the confidence of the QA model: high-utility passages yield correct answers, while low-utility passages increase the likelihood of error. Utility is quantitatively defined by two main criteria:

Answer Accuracy ( $a(y)$ ): A binary indicator of correctness, judged by a dedicated LLM-based evaluator.
Entailment ( $e(y)$ ): A continuous score reflecting the degree to which the passage supports the generated answer, obtained with an NLI model.

Combined passage utility is calculated as:

$\upsilon_\mathcal{M} = \frac{a(y) + e(y)}{2}$

and ranges from 0 (useless) to 1 (maximally useful).

A lightweight neural model (Siamese BERT encoder with MLP head) is trained to predict $\upsilon_\mathcal{M}$ for $(x, p)$ pairs using a contrastive pairwise ranking objective, further regularized with a binary cross-entropy term on accuracy. This allows scalable and efficient training without reliance on expensive LLM fine-tuning.

Answer Uncertainty Aggregation

Overall answer uncertainty for a set of passages $R$ is defined as the maximal utility score among all passages:

$u_\mathcal{M}(\{x, R\}) = \max_{p \in R} \upsilon_\mathcal{M}(\{x, p\})$

The intuition is that a single high-utility passage suffices for confident answering, while uniformly low-utility passages signal high uncertainty.

Experimental Setup

Backbone QA Models: Gemma2-9B, Llama-3.1-8B, Mistral-7B-v0.3, and variants.
Datasets: Natural Questions, TriviaQA, WebQuestions, SQuAD, PopQA, RefuNQ.
Retriever: Contriever-MSMARCO for external Wikipedia evidence.
Evaluators: Qwen2-72B-Instruct (LLM) for answer accuracy, ALBERT-xlarge+NLI for entailment.
Baselines: Sampling-based uncertainty metrics (perplexity, regular entropy, semantic entropy, p(true)), information-theoretic measures, and passage re-ranking.

Empirical Results

Uncertainty Estimation

Across six diverse QA benchmarks, Passage Utility consistently equals or outperforms all comparison methods on AUROC for incorrect answer detection—particularly in settings with adversarial or rare entity questions (RefuNQ) and those requiring complex reasoning. Notably, it supersedes p(true) and sampling-based entropy metrics in both accuracy and computational efficiency.

Selective Answering

When QA models are permitted to refuse answering high-uncertainty questions, selective answering using Passage Utility yields higher average accuracy than information-theoretic or LLM-based uncertainty estimates. This demonstrates practical benefits for system calibration and abstention policy design.

Passage Re-ranking

Ranking retrieved passages by their predicted utility improves QA accuracy compared to default retriever ordering, especially when context sizes are constrained (top- $k$ passages). This suggests that the utility estimation model robustly identifies helpful evidence and can directly augment retrieval pipelines.

Model Size Robustness

The framework maintains superior performance across model scales (2B, 9B, 27B parameters), underlining generality and applicability for both compact and large LLM QA engines.

Ablation Studies and Training Objective

Pairwise contrastive ranking over passage pairs and answer accuracy supervision are both crucial; loss of either consistently degrades performance, indicating the necessity of integrating both direct correctness and entailment into utility prediction.

Generalization

Zero-shot transfer of Passage Utility estimators trained on SQuAD to other datasets remains competitive with sampling-based baselines (p(true)), confirming domain-agnostic utility estimation to an extent.

Computational Cost

Unlike sampling-heavy methods, the proposed utility estimator requires only a BERT-sized inference per passage, drastically reducing overhead for production deployments.

Theoretical and Practical Implications

The paper advances uncertainty quantification in retrieval augmented QA beyond model-centric sampling and self-estimation. By grounding uncertainty in passage utility—which directly reflects the model's ability to map evidence to questions—the framework enables scalable, interpretable, and fine-grained error prediction. This is particularly relevant when QA models encounter noisy, incomplete, or out-of-domain evidence, or unanswerable/ambiguous questions; in such cases, the system can withhold an answer or signal low confidence.

Further, passage utility scoring naturally extends to retriever feedback, enabling iterative improvement in evidence selection and potential incorporation into end-to-end retriever-QA co-training, subject to future work.

Future Directions

Potential research avenues include:

Extension to long-form generation and multi-hop reasoning tasks, possibly by aggregating utility scores through more sophisticated schemes.
Integration of aleatoric uncertainty estimation from ambiguous questions, possibly fusing with auxiliary tasks such as note generation or passage summarization.
Deployment in active learning and online QA moderation, where human feedback on passage utility and answer support can rapidly improve system reliability.

Conclusion

Uncertainty Quantification via Passage Utility in RAQA offers a resource-efficient, empirically validated approach delivering reliable error prediction, improved answer calibration, and retriever augmentation. Its compatibility with modern QA architectures and robustness across data and model scales make it highly attractive for real-world QA deployment, especially under tight latency and reliability constraints.