Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uncertainty-aware Language Modeling for Selective Question Answering (2311.15451v1)

Published 26 Nov 2023 in cs.CL and cs.LG

Abstract: We present an automatic LLM conversion approach that produces uncertainty-aware LLMs capable of estimating uncertainty with every prediction. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems. We evaluate converted models on the selective question answering setting -- to answer as many questions as possible while maintaining a given accuracy, forgoing providing predictions when necessary. As part of our results, we test BERT and Llama 2 model variants on the SQuAD extractive QA task and the TruthfulQA generative QA task. We show that using the uncertainty estimates provided by our approach to selectively answer questions leads to significantly higher accuracy over directly using model probabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Amini, A.; et al. 2023. Capsa Software Library.
  2. Weight uncertainty in neural network. In International conference on machine learning, 1613–1622. PMLR.
  3. Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness. arXiv:2308.16175.
  4. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. arXiv:2309.03883.
  5. Human Uncertainty in Concept-Based AI Systems. arXiv:2303.12872.
  6. Deep gaussian processes. In Artificial intelligence and statistics, 207–215. PMLR.
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
  8. Confidence modeling for neural semantic parsing. arXiv preprint arXiv:1805.04604.
  9. El-Yaniv, R.; et al. 2010. On the Foundations of Noise-free Selective Classification. Journal of Machine Learning Research, 11(5).
  10. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, 1050–1059. PMLR.
  11. Selective classification for deep neural networks. Advances in neural information processing systems, 30.
  12. Posing fair generalization tasks for natural language inference. arXiv preprint arXiv:1911.00811.
  13. A framework for merging and ranking of answers in DeepQA. IBM Journal of Research and Development, 56(3.4): 14–1.
  14. On calibration of modern neural networks. In International conference on machine learning, 1321–1330. PMLR.
  15. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10: 178–206.
  16. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9: 962–977.
  17. Selective question answering under domain shift. arXiv preprint arXiv:2006.09462.
  18. Nemo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577.
  19. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.
  20. Improving the repeatability of deep learning models with Monte Carlo dropout. npj Digital Medicine, 5(1): 174.
  21. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  22. Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models. arXiv:2305.19187.
  23. Capsa: A Unified Framework for Quantifying Risk in Deep Neural Networks. In 5th Robot Learning Workshop: Trustworthy Robotics.
  24. Capsa: A Unified Framework for Quantifying Risk in Deep Neural Networks. arXiv:2308.00231.
  25. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896.
  26. SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning. arXiv:2308.00436.
  27. Dropconnect is effective in modeling uncertainty of bayesian deep networks. Scientific reports, 11(1): 5458.
  28. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, 55–60. IEEE.
  29. Overview of ResPubliQA 2009: Question answering evaluation over European legislation. In Multilingual Information Access Evaluation I. Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece, September 30-October 2, 2009, Revised Selected Papers 10, 174–196. Springer.
  30. Conformal Language Modeling. arXiv:2306.10193.
  31. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv:1806.03822.
  32. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250.
  33. Quizbowl: The Case for Incremental Question Answering. arXiv:1904.04792.
  34. BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
  35. Post-hoc Uncertainty Learning using a Dirichlet Meta-Model. arXiv:2212.07359.
  36. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1): 1929–1958.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  38. Wang, M.; et al. 2006. A survey of answer extraction techniques in factoid question answering. Computational Linguistics, 1(1): 1–14.
  39. Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561.
  40. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144.
  41. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, 19–27.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Qi Yang (111 papers)
  2. Shreya Ravikumar (1 paper)
  3. Fynn Schmitt-Ulms (2 papers)
  4. Satvik Lolla (4 papers)
  5. Ege Demir (1 paper)
  6. Iaroslav Elistratov (2 papers)
  7. Alex Lavaee (2 papers)
  8. Sadhana Lolla (4 papers)
  9. Elaheh Ahmadi (5 papers)
  10. Daniela Rus (181 papers)
  11. Alexander Amini (32 papers)
  12. Alejandro Perez (53 papers)
Citations (6)

Summary

The paper introduces an innovative framework designed to enhance the performance of LLMs in the field of question answering (QA) by incorporating a measure of uncertainty with every prediction. Developed by Themis AI Inc, this framework is agnostic to model type and data, implying that it can be applied to a variety of models and datasets without being constrained by their specific architectures or the nature of the data.

Question answering is a critical task for many LLM applications, where the goal is not just to generate any answer, but to provide accurate and reliable responses. Traditional LLMs can struggle with this, often failing to gauge their confidence levels appropriately, which can lead to incorrect or misleading answers. The paper outlines that many factors contribute to these failures, including out-of-domain data, prompt ambiguities, inconsistent training information, and hallucinations (incorrectly synthesized information).

The researchers present a technique that improves the capability of LLMs in selective QA tasks, which require the model to maintain a high level of accuracy while answering as many questions as possible. Rather than attempting to respond to every query, an LLM with selective prediction can abstain from answering when its confidence is low, thus improving the overall output reliability.

The key to this approach is converting existing LLMs into uncertainty-aware variants, which can detect different types of uncertainties associated with their predictions. The paper explores two main kinds of uncertainty: aleatoric and epistemic. Aleatoric uncertainty is associated with inherent noise within the dataset, while epistemic uncertainty links to the model's knowledge limitations—essentially what the model does not know.

In an empirical paper featuring both extractive and generative QA models, the framework demonstrated that it could enhance accuracy across a range of confidence levels. Specifically, it showed that conventional measures like softmax probabilities are not reliable confidence indicators compared to uncertainty measures; where high softmax probabilities often correlated with lower accuracies, the uncertainty-aware models achieved more consistent results.

Moreover, the paper reports on an algorithmic method that automatically converts LLMs to calculate these uncertainty metrics efficiently, without adding significant computational overhead or requiring additional models or systems. This is particularly valuable for developers who seek to enhance existing models without the need for extensive restructuring or additional resources.

In conclusion, the paper represents a notable stride in improving the reliability and efficacy of LLMs in QA tasks. By addressing the critical issue of model confidence and introducing an easily integrable solution that quantifies uncertainty, the researchers provide a pathway towards models that can better discern when to answer a question and when to pass, leading to more trustworthy AI-based systems.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com