Calibrating the Confidence of Large Language Models by Eliciting Fidelity (2404.02655v2)
Abstract: LLMs optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these LLMs often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the LLM confidence into the \textit{Uncertainty} about the question and the \textit{Fidelity} to the answer generated by LLMs. Then, we propose a plug-and-play method to estimate the confidence of LLMs. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on \textit{Truly Well-Calibrated Confidence}. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration.
- A general language assistant as a laboratory for alignment.
- Constitutional ai: Harmlessness from ai feedback.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Sparks of artificial general intelligence: Early experiments with GPT-4. ArXiv preprint arXiv:2303.12712.
- Scaling instruction-finetuned language models.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
- Wade Fagen-Ulmschneider. 2023. Perception of probability words. Ms., UIUC, 05-24-2023.
- On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR.
- Investigating uncertainty calibration of aligned language models under the multiple-choice setting.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
- Language models (mostly) know what they know.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
- Teaching models to express their uncertainty in words. Transactions on Machine Learning Research.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, Singapore. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.
- Baichuan 2: Open large-scale language models.
- Alignment for honesty.
- Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada. Association for Computational Linguistics.
- Instruction tuning for large language models: A survey.
- Automatic calibration and error correction for generative large language models via pareto optimal self-supervision.
- Mozhi Zhang (19 papers)
- Mianqiu Huang (5 papers)
- Rundong Shi (4 papers)
- Linsen Guo (1 paper)
- Chong Peng (41 papers)
- Peng Yan (88 papers)
- Yaqian Zhou (17 papers)
- Xipeng Qiu (257 papers)