Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models (2402.17124v2)
Abstract: For a LLM to be trustworthy, its confidence level should be well-calibrated with its actual performance. While it is now common sense that LLM performances are greatly impacted by prompts, the confidence calibration in prompting LLMs has yet to be thoroughly explored. In this paper, we explore how different prompting strategies influence LLM confidence calibration and how it could be improved. We conduct extensive experiments on six prompting methods in the question-answering context and we observe that, while these methods help improve the expected LLM calibration, they also trigger LLMs to be over-confident when responding to some instances. Inspired by human cognition, we propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps. First, FaR elicits the known "facts" that are relevant to the input prompt from the LLM. And then it asks the model to "reflect" over them to generate the final answer. Experiments show that FaR prompting achieves significantly better calibration; it lowers the Expected Calibration Error by 23.5% on our multi-purpose QA tasks. Notably, FaR prompting even elicits the capability of verbally expressing concerns in less confident scenarios, which helps trigger retrieval augmentation for solving these harder instances.
- Palm 2 technical report.
- Constitutional ai: Harmlessness from ai feedback.
- Richard A Block and David R Harper. 1991. Overconfidence in estimation: Testing the anchoring-and-adjustment hypothesis. Organizational Behavior and Human Decision Processes, 49(2):188–207.
- Question answering with subgraph embeddings.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Skills-in-context prompting: Unlocking compositionality in large language models. arXiv preprint arXiv:2308.00304.
- Improving faithfulness in abstractive summarization with contrast candidate generation and selection. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5935–5941, Online. Association for Computational Linguistics.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Computational Linguistics.
- Language model cascades. ArXiv, abs/2207.10342.
- Ai-moderated decision-making: Capturing and balancing anchoring bias in sequential decision tasks. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, New York, NY, USA. Association for Computing Machinery.
- Adrian Furnham and Hua Chu Boo. 2011. A literature review of the anchoring effect. The Journal of Socio-Economics, 40(1):35–42.
- Countering the anchoring and adjustment bias with decision support systems. Decision Support Systems, 29(2):195–206.
- Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL).
- On calibration of modern neural networks.
- Dave Hulbert. 2023. Tree of knowledge: Tok aka tree of knowledge dataset for large language models llm. https://github.com/dave1010/tree-of-thought-prompting.
- Language models (mostly) know what they know.
- Large language models are zero-shot reasoners.
- The anchoring bias reflects rational use of cognitive resources. Psychonomic bulletin & review, 25:322–349.
- Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022.
- Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3154–3169, Dublin, Ireland. Association for Computational Linguistics.
- Expertqa: Expert-curated questions and attributed answers. In arXiv.
- Characterizing intrinsic compositionality in transformers with tree projections. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- In-context learning and induction heads. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- OpenAI. 2022. Chatgpt.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Measuring and narrowing the compositionality gap in language models.
- Prompting gpt-3 to be reliable. In International Conference on Learning Representations (ICLR).
- Re-examining calibration: The case of question answering. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2814–2829, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- The calibration gap between model and human confidence in large language models. ArXiv, abs/2401.13835.
- Lamda: Language models for dialog applications.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. ArXiv, abs/2305.14975.
- Llama: Open and efficient foundation language models.
- Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
- Self-consistency improves chain of thought reasoning in language models.
- Self-instruct: Aligning language model with self generated instructions.
- Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- " according to…" prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252.
- Alignment for honesty.
- Tree of thoughts: Deliberate problem solving with large language models.
- Thrust: Adaptively propels large language models with external knowledge.
- Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. ArXiv, abs/2401.06730.