Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Calibrating Large Language Models with Sample Consistency (2402.13904v1)

Published 21 Feb 2024 in cs.CL

Abstract: Accurately gauging the confidence level of LLMs' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and massive scale. In this work, we explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. We perform an extensive evaluation across various open and closed-source models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches. Meanwhile, we find that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Moreover, confidence scores obtained from consistency have the potential to enhance model performance. Finally, we offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.
  2. BIG-Bench collaboration. 2021. Beyond the Imitation Game: Measuring and extrapolating the capabilities of language models.
  3. Glenn W Brier. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Evaluating Large Language Models Trained on Code.
  6. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv preprint, abs/2211.12588.
  7. A Close Look into the Calibration of Pre-trained Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1343–1367, Toronto, Canada. Association for Computational Linguistics.
  8. Training Verifiers to Solve Math Word Problems.
  9. Yarin Gal. 2016. Uncertainty in deep learning.
  10. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1050–1059. JMLR.org.
  11. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  12. A Survey of Language Model Confidence Estimation and Calibration.
  13. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  14. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR.
  15. Mistral 7b.
  16. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  17. Language models (mostly) know what they know.
  18. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6402–6413.
  19. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 7167–7177.
  20. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research.
  21. Faithful chain-of-thought reasoning. ArXiv preprint, abs/2301.13379.
  22. Self-refine: Iterative refinement with self-feedback. ArXiv preprint, abs/2303.17651.
  23. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore. Association for Computational Linguistics.
  24. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, Online. Association for Computational Linguistics.
  25. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  26. Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 25192–25204.
  27. OpenAI. 2023. Gpt-4 technical report.
  28. Confidence estimation methods for neural networks: A practical comparison. IEEE transactions on neural networks, 12(6):1278–1287.
  29. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  30. Strength in numbers: Estimating confidence of large language models by prompt agreement. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 326–362, Toronto, Canada. Association for Computational Linguistics.
  31. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
  32. Screws: A modular framework for reasoning with revisions. ArXiv preprint, abs/2309.13075.
  33. Automatic generation of socratic subquestions for teaching math word problems. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4136–4149, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  34. CLUTRR: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4506–4515, Hong Kong, China. Association for Computational Linguistics.
  35. Evaluating the factual consistency of large language models through summarization. ArXiv preprint, abs/2211.08412.
  36. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288.
  37. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. ArXiv preprint, abs/2310.07521.
  38. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  39. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  40. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. ArXiv preprint, abs/2306.13063.
  41. Detection of Word Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation.
  42. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qing Lyu (35 papers)
  2. Kumar Shridhar (25 papers)
  3. Chaitanya Malaviya (24 papers)
  4. Li Zhang (690 papers)
  5. Yanai Elazar (44 papers)
  6. Niket Tandon (40 papers)
  7. Marianna Apidianaki (29 papers)
  8. Mrinmaya Sachan (124 papers)
  9. Chris Callison-Burch (102 papers)
Citations (14)