- The paper introduces a novel utility metric based on answer accuracy and entailment, demonstrating that high passage utility correlates with correct answers.
- It applies a lightweight Siamese BERT model with a contrastive ranking objective to efficiently predict passage utility without expensive LLM fine-tuning.
- Empirical results across multiple QA benchmarks show improved answer calibration, selective answering, and passage re-ranking performance.
Uncertainty Quantification in Retrieval Augmented Question Answering
Introduction and Motivation
The paper tackles the challenge of quantifying uncertainty in Retrieval Augmented Question Answering (RAQA) systems, which enhance QA model capabilities by incorporating externally retrieved textual evidence. Despite improvements in accuracy and reduction of hallucinations by these systems, a systematic assessment of the usefulness of retrieved passages—and their impact on answer correctness and uncertainty—has been largely absent. The authors propose a principled framework centered on the utility of individual passages to estimate the uncertainty of the final QA output, facilitating error detection and enabling more reliable QA deployment.
Methodological Framework
Passage Utility Estimation
The core hypothesis is that the utility of each retrieved passage is a direct proxy for the confidence of the QA model: high-utility passages yield correct answers, while low-utility passages increase the likelihood of error. Utility is quantitatively defined by two main criteria:
- Answer Accuracy (a(y)): A binary indicator of correctness, judged by a dedicated LLM-based evaluator.
- Entailment (e(y)): A continuous score reflecting the degree to which the passage supports the generated answer, obtained with an NLI model.
Combined passage utility is calculated as:
υM=2a(y)+e(y)
and ranges from 0 (useless) to 1 (maximally useful).
A lightweight neural model (Siamese BERT encoder with MLP head) is trained to predict υM for (x,p) pairs using a contrastive pairwise ranking objective, further regularized with a binary cross-entropy term on accuracy. This allows scalable and efficient training without reliance on expensive LLM fine-tuning.
Answer Uncertainty Aggregation
Overall answer uncertainty for a set of passages R is defined as the maximal utility score among all passages:
uM({x,R})=p∈RmaxυM({x,p})
The intuition is that a single high-utility passage suffices for confident answering, while uniformly low-utility passages signal high uncertainty.
Experimental Setup
- Backbone QA Models: Gemma2-9B, Llama-3.1-8B, Mistral-7B-v0.3, and variants.
- Datasets: Natural Questions, TriviaQA, WebQuestions, SQuAD, PopQA, RefuNQ.
- Retriever: Contriever-MSMARCO for external Wikipedia evidence.
- Evaluators: Qwen2-72B-Instruct (LLM) for answer accuracy, ALBERT-xlarge+NLI for entailment.
- Baselines: Sampling-based uncertainty metrics (perplexity, regular entropy, semantic entropy, p(true)), information-theoretic measures, and passage re-ranking.
Empirical Results
Uncertainty Estimation
Across six diverse QA benchmarks, Passage Utility consistently equals or outperforms all comparison methods on AUROC for incorrect answer detection—particularly in settings with adversarial or rare entity questions (RefuNQ) and those requiring complex reasoning. Notably, it supersedes p(true) and sampling-based entropy metrics in both accuracy and computational efficiency.
Selective Answering
When QA models are permitted to refuse answering high-uncertainty questions, selective answering using Passage Utility yields higher average accuracy than information-theoretic or LLM-based uncertainty estimates. This demonstrates practical benefits for system calibration and abstention policy design.
Passage Re-ranking
Ranking retrieved passages by their predicted utility improves QA accuracy compared to default retriever ordering, especially when context sizes are constrained (top-k passages). This suggests that the utility estimation model robustly identifies helpful evidence and can directly augment retrieval pipelines.
Model Size Robustness
The framework maintains superior performance across model scales (2B, 9B, 27B parameters), underlining generality and applicability for both compact and large LLM QA engines.
Ablation Studies and Training Objective
Pairwise contrastive ranking over passage pairs and answer accuracy supervision are both crucial; loss of either consistently degrades performance, indicating the necessity of integrating both direct correctness and entailment into utility prediction.
Generalization
Zero-shot transfer of Passage Utility estimators trained on SQuAD to other datasets remains competitive with sampling-based baselines (p(true)), confirming domain-agnostic utility estimation to an extent.
Computational Cost
Unlike sampling-heavy methods, the proposed utility estimator requires only a BERT-sized inference per passage, drastically reducing overhead for production deployments.
Theoretical and Practical Implications
The paper advances uncertainty quantification in retrieval augmented QA beyond model-centric sampling and self-estimation. By grounding uncertainty in passage utility—which directly reflects the model's ability to map evidence to questions—the framework enables scalable, interpretable, and fine-grained error prediction. This is particularly relevant when QA models encounter noisy, incomplete, or out-of-domain evidence, or unanswerable/ambiguous questions; in such cases, the system can withhold an answer or signal low confidence.
Further, passage utility scoring naturally extends to retriever feedback, enabling iterative improvement in evidence selection and potential incorporation into end-to-end retriever-QA co-training, subject to future work.
Future Directions
Potential research avenues include:
- Extension to long-form generation and multi-hop reasoning tasks, possibly by aggregating utility scores through more sophisticated schemes.
- Integration of aleatoric uncertainty estimation from ambiguous questions, possibly fusing with auxiliary tasks such as note generation or passage summarization.
- Deployment in active learning and online QA moderation, where human feedback on passage utility and answer support can rapidly improve system reliability.
Conclusion
Uncertainty Quantification via Passage Utility in RAQA offers a resource-efficient, empirically validated approach delivering reliable error prediction, improved answer calibration, and retriever augmentation. Its compatibility with modern QA architectures and robustness across data and model scales make it highly attractive for real-world QA deployment, especially under tight latency and reliability constraints.