Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations (2404.10960v1)
Abstract: A major barrier towards the practical deployment of LLMs is their lack of reliability. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety. In all three cases, models should ideally abstain from responding, much like humans, whose ability to understand uncertainty makes us refrain from answering questions we don't know. Inspired by analogous approaches in classification, this study explores the feasibility and efficacy of abstaining while uncertain in the context of LLMs within the domain of question-answering. We investigate two kinds of uncertainties, statistical uncertainty metrics and a distinct verbalized measure, termed as In-Dialogue Uncertainty (InDU). Using these uncertainty measures combined with models with and without Reinforcement Learning with Human Feedback (RLHF), we show that in all three situations, abstention based on the right kind of uncertainty measure can boost the reliability of LLMs. By sacrificing only a few highly uncertain samples we can improve correctness by 2% to 8%, avoid 50% hallucinations via correctly identifying unanswerable questions and increase safety by 70% up to 99% with almost no additional computational overhead.
- Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models, May 2023. URL http://arxiv.org/abs/2305.13712. arXiv:2305.13712 [cs].
- The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge. Scientific Reports, 13(1):7240, May 2023. ISSN 2045-2322. doi: 10.1038/s41598-023-33607-z. URL https://www.nature.com/articles/s41598-023-33607-z.
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, April 2022. URL http://arxiv.org/abs/2204.05862. arXiv:2204.05862 [cs].
- Benchmarking bayesian deep learning on diabetic retinopathy detection tasks. arXiv preprint arXiv:2211.12717, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 1970.
- Training Verifiers to Solve Math Word Problems, November 2021. URL http://arxiv.org/abs/2110.14168. arXiv:2110.14168 [cs].
- Ran El-Yaniv et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5), 2010.
- Natural language of uncertainty: numeric hedge words. International Journal of Approximate Reasoning, 57:19–39, February 2015. ISSN 0888613X. doi: 10.1016/j.ijar.2014.11.003. URL https://linkinghub.elsevier.com/retrieve/pii/S0888613X14001728.
- Bruce Fraser. PRAGMATIC COMPETENCE: THE CASE OF HEDGING.
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned, November 2022. URL http://arxiv.org/abs/2209.07858. arXiv:2209.07858 [cs].
- Selective classification for deep neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/4a8423d5e91fda00bb7e46540e2b0cf1-Paper.pdf.
- SelectiveNet: A deep neural network with an integrated reject option. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2151–2159. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/geifman19a.html.
- Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies, January 2021. URL http://arxiv.org/abs/2101.02235. arXiv:2101.02235 [cs].
- Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting, November 2023. URL http://arxiv.org/abs/2310.11732. arXiv:2310.11732 [cs].
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
- A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks, October 2018. URL http://arxiv.org/abs/1610.02136. arXiv:1610.02136 [cs].
- Scaling Out-of-Distribution Detection for Real-World Settings, May 2022. URL http://arxiv.org/abs/1911.11132. arXiv:1911.11132 [cs].
- Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models, October 2023. URL http://arxiv.org/abs/2307.10236. arXiv:2307.10236 [cs].
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, December 2023. URL http://arxiv.org/abs/2312.06674. arXiv:2312.06674 [cs].
- A Lexicon-Based Approach for Detecting Hedges in Informal Text.
- How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering, May 2021. URL http://arxiv.org/abs/2012.00955. arXiv:2012.00955 [cs].
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension, May 2017. URL http://arxiv.org/abs/1705.03551. arXiv:1705.03551 [cs].
- Language Models (Mostly) Know What They Know, November 2022. URL http://arxiv.org/abs/2207.05221. arXiv:2207.05221 [cs].
- Unveiling Safety Vulnerabilities of Large Language Models, November 2023. URL http://arxiv.org/abs/2311.04124. arXiv:2311.04124 [cs].
- SEMANTIC UNCERTAINTY: LINGUISTIC INVARIANCES FOR UNCERTAINTY ESTIMATION IN NATURAL LANGUAGE GENERATION. 2023.
- Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks, August 2020. URL http://arxiv.org/abs/1706.02690. arXiv:1706.02690 [cs, stat].
- Teaching Models to Express Their Uncertainty in Words, June 2022. URL http://arxiv.org/abs/2205.14334. arXiv:2205.14334 [cs].
- Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187, 2023a.
- Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models, October 2023b. URL http://arxiv.org/abs/2305.19187. arXiv:2305.19187 [cs, stat].
- Energy-based Out-of-distribution Detection, April 2021. URL http://arxiv.org/abs/2010.03759. arXiv:2010.03759 [cs].
- Uncertainty Estimation in Autoregressive Structured Prediction, February 2021. URL http://arxiv.org/abs/2002.07650. arXiv:2002.07650 [cs, stat].
- GPT-4 Technical Report, December 2023. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
- "You might think about slightly revising the title": identifying hedges in peer-tutoring interactions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2160–2174, 2022. doi: 10.18653/v1/2022.acl-long.153. URL http://arxiv.org/abs/2306.14911. arXiv:2306.14911 [cs].
- CoQA: A Conversational Question Answering Challenge, March 2019. URL http://arxiv.org/abs/1808.07042. arXiv:1808.07042 [cs].
- The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models, November 2023. URL http://arxiv.org/abs/2310.11877. arXiv:2310.11877 [cs].
- Selective" selective prediction": Reducing unnecessary abstention in vision-language reasoning. arXiv preprint arXiv:2402.15610, 2024.
- Gemini: A Family of Highly Capable Multimodal Models, December 2023. URL http://arxiv.org/abs/2312.11805. arXiv:2312.11805 [cs].
- Word Embeddings-Based Uncertainty Detection in Financial Disclosures. In Proceedings of the First Workshop on Economics and Natural Language Processing, pages 32–37, Melbourne, Australia, 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-3104. URL http://aclweb.org/anthology/W18-3104.
- Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback, October 2023. URL http://arxiv.org/abs/2305.14975. arXiv:2305.14975 [cs].
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
- Using Hedge Detection to Improve Committed Belief Tagging. In Proceedings of the Workshop on Computational Semantics beyond Events and Roles, pages 1–5, New Orleans, Louisiana, 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-1301. URL http://aclweb.org/anthology/W18-1301.
- Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs, June 2023. URL http://arxiv.org/abs/2306.13063. arXiv:2306.13063 [cs].
- Do Large Language Models Know What They Don’t Know?
- Automatic Calibration and Error Correction for Generative Large Language Models via Pareto Optimal Self-Supervision, October 2023. URL http://arxiv.org/abs/2306.16564. arXiv:2306.16564 [cs, stat].
- Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty, January 2024. URL http://arxiv.org/abs/2401.06730. arXiv:2401.06730 [cs].
- AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models, December 2023. URL http://arxiv.org/abs/2310.15140. arXiv:2310.15140 [cs].
- Universal and Transferable Adversarial Attacks on Aligned Language Models, December 2023. URL http://arxiv.org/abs/2307.15043. arXiv:2307.15043 [cs].
- Christian Tomani (8 papers)
- Kamalika Chaudhuri (121 papers)
- Ivan Evtimov (24 papers)
- Daniel Cremers (274 papers)
- Mark Ibrahim (36 papers)