Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs (2306.13063v2)

Published 22 Jun 2023 in cs.CL
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Abstract: Empowering LLMs to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs.

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

The paper investigates the capability of LLMs to express their uncertainty in responses, aiming to improve the reliability of AI-driven decision-making. Traditional methods for eliciting confidence, relying on white-box access to model internals or model fine-tuning, are increasingly impractical, especially for black-box, closed-source models like GPT-4 and LLaMA 2 Chat. This paper explores a black-box framework composed of three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for leveraging consistency among those responses.

Key Findings

  1. Overconfidence in LLMs: The empirical analysis shows that LLMs are prone to overconfident verbalizations of their confidence, often reflecting human-like patterns, such as overestimation of certainty.
  2. Scalability with Model Capability: As the maturity of the LLM improves, so do both calibration and failure prediction performances, though far from what might be considered ideal.
  3. Mitigating Overconfidence: Utilizing strategies like human-inspired prompts, assessing consistency among multiple generated responses, and advanced aggregation techniques can temper overconfidence and improve confidence calibration in tasks like commonsense and arithmetic reasoning.
  4. Comparison to White-Box Methods: Although white-box methods offer more accurate confidence calibration than black-box ones, the performance gap in AUROC between them is notably narrow, suggesting that further development in black-box approaches is warranted.
  5. Challenges in Specific Tasks: The paper reveals that all existing techniques face significant challenges when tasked with problems necessitating specialized knowledge, such as professional law or ethics, suggesting ample room for advancement.

Implications and Future Directions

The paper serves as a baseline for subsequent investigations into black-box approaches to confidence elicitation. Practically, these findings could inform developers on integrating more reliable decision-making capabilities into AI applications, especially where access to internal model parameters is limited. Theoretically, the insights derived suggest potential pathways for improving confidence elicitation, challenging researchers to potentially blend white-box and black-box methods to strike a balance between performance and feasibility.

Looking ahead, future advancements might involve hybrid methods that incorporate limited white-box data (like output logits) to enhance black-box models or delve into novel neural architectures that inherently facilitate better confidence estimations. Additionally, optimizing the trade-off between computational efficiency and confidence accuracy in multi-query sampling strategies and refining aggregation techniques to draw more semantic correlations are promising areas of pursuit.

Overall, this paper illuminates the intricacies of constructing effective confidence elicitation systems in LLMs, furthering the collective understanding necessary to build more robust, trustworthy AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Area under the precision-recall curve: Point estimates and confidence intervals. In Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip Železný (eds.), Machine Learning and Knowledge Discovery in Databases, pp.  451–466, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-40994-3.
  2. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  3. A close look into the calibration of pre-trained language models. arXiv preprint arXiv:2211.00151, 2022.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  6. Are humans good intuitive statisticians after all? rethinking some conclusions from the literature on judgment under uncertainty. cognition, 58(1):1–73, 1996.
  7. Great models think alike: Improving model reliability via inter-model latent agreement. arXiv preprint arXiv:2305.01481, 2023.
  8. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. PMLR, 2016.
  9. Statistical methods for eliciting probability distributions. Journal of the American statistical Association, 100(470):680–701, 2005.
  10. A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342, 2021.
  11. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies, 2021.
  12. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, pp.  1197–1208, 2013.
  13. On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017.
  14. Measuring massive multitask language understanding, 2021.
  15. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021.
  16. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  17. Ethan Kim. Sports understanding in bigbench, 2021.
  18. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  19. Calibrated and sharp uncertainties in deep learning via density estimation. In International Conference on Machine Learning, pp. 11683–11693. PMLR, 2022.
  20. Accurate uncertainties for deep learning using calibrated regression. In International conference on machine learning, pp. 2796–2804. PMLR, 2018.
  21. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  22. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022.
  23. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872, 2022.
  24. Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems, volume 34, pp.  15682–15694, 2021.
  25. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
  26. OpenAI. ChatGPT. https://www.openai.com/gpt-3/, 2021. Accessed: April 21, 2023.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
  29. Re-examining calibration: The case of question answering. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  2814–2829, 2022.
  30. Natural language processing and assessment of resident feedback quality. Journal of Surgical Education, 78(6):e72–e77, 2021. ISSN 1931-7204. doi: https://doi.org/10.1016/j.jsurg.2021.05.012. URL https://www.sciencedirect.com/science/article/pii/S1931720421001537.
  31. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  32. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023.
  33. Towards trustworthy predictions from deep neural networks with fast adversarial calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  9886–9896, 2021.
  34. Llama: Open and efficient foundation language models, 2023.
  35. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. science, 185(4157):1124–1131, 1974.
  36. How to elicit many probabilities. arXiv preprint arXiv:1301.6745, 2013.
  37. Learning to count objects with few exemplar annotations. arXiv preprint arXiv:1905.07898, 2019.
  38. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  39. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  40. Data understanding in bigbench, 2021.
  41. On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:2103.15025, 2021.
  42. Birds of a feather trust together: Knowing when to trust a classifier via adaptive neighborhood aggregation. arXiv preprint arXiv:2211.16466, 2022.
  43. Proximity-informed calibration for deep neural networks. arXiv preprint arXiv:2306.04590, 2023.
  44. Large-scale robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3040–3049, 2021.
  45. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pp.  609–616, 2001.
  46. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International conference on machine learning, pp. 11117–11128. PMLR, 2020.
  47. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Miao Xiong (14 papers)
  2. Zhiyuan Hu (30 papers)
  3. Xinyang Lu (15 papers)
  4. Yifei Li (92 papers)
  5. Jie Fu (229 papers)
  6. Junxian He (66 papers)
  7. Bryan Hooi (158 papers)
Citations (254)