Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A (2402.13213v3)
Abstract: We study 15 LLMs fine-tuned for chat and find that their maximum softmax probabilities (MSPs) are consistently miscalibrated on multiple-choice Q&A. However, those MSPs might still encode useful uncertainty information. Specifically, we hypothesized that wrong answers would be associated with smaller MSPs compared to correct answers. Via rigorous statistical testing, we show that this hypothesis holds for models which perform well on the underlying Q&A task. We also find a strong direction correlation between Q&A accuracy and MSP correctness prediction, while finding no correlation between Q&A accuracy and calibration error. This suggests that within the current fine-tuning paradigm, we can expect correctness prediction but not calibration to improve as LLM capabilities progress. To demonstrate the utility of correctness prediction, we show that when models have the option to abstain, performance can be improved by selectively abstaining based on the MSP of the initial model response, using only a small amount of labeled data to choose the MSP threshold.
- 01-ai. 01-ai/Yi-6B-Chat · Hugging Face, 2023. Accessed Jan 2024.
- Falcon-40B: an open large language model with state-of-the-art performance. 2023.
- The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
- Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703, 2023.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Bennet, P. N. Assessing the calibration of Naive Bayes’ posterior estimates. School of Computer Science, Carnegie Mellon University, 2000.
- Bonferroni, C. E. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3–62, 1936.
- Bradley, A. P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159, 1997. ISSN 0031-3203. doi: https://doi.org/10.1016/S0031-3203(96)00142-2.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.
- Reducing labeling effort for structured prediction tasks. In AAAI, volume 5, pp. 746–751, 2005.
- The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983.
- Calibration of pre-trained transformers. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 295–302, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.21.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- A survey of language model confidence estimation and calibration. arXiv preprint arXiv:2311.08298, 2023.
- On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017a.
- On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1321–1330, Sydney, NSW, Australia, August 2017b. JMLR.org.
- A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR ’17), October 2017.
- Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100, 2020.
- Measuring Massive Multitask Language Understanding. In Proceedings of the Ninth International Conference on Learning Representations (ICLR ’21), 2021.
- Scaling Out-of-Distribution Detection for Real-World Settings. In Proceedings of the 39th International Conference on Machine Learning, pp. 8759–8773. PMLR, June 2022. ISSN: 2640-3498.
- Uncertainty in natural language processing: Sources, quantification, and applications. arXiv preprint arXiv:2306.04459, 2023.
- Mistral 7B, October 2023. arXiv:2310.06825 [cs].
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Language models (mostly) know what they know. URL https://arxiv. org/abs/2207.05221, 5, 2022.
- Deep Neural Networks Tend To Extrapolate Predictably, October 2023. arXiv:2310.00873 [cs].
- SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling, December 2023. arXiv:2312.15166 [cs].
- Confidently wrong: Exploring the calibration and expression of (un) certainty of large language models in a multilingual setting. In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), pp. 1–9, 2023.
- Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022a.
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229.
- Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187, 2023.
- Mangan, D. Judge sanctions lawyers for brief written by A.I. with fake citations. CNBC, June 2023. Accessed Jan 2024.
- On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pp. 50–60, 1947.
- Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872, 2022.
- Posterior calibration and exploratory analysis for natural language processing models. arXiv preprint arXiv:1508.05154, 2015.
- Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp. 625–632, 2005.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- WinoGrande: an adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381.
- Active hidden markov models for information extraction. In International symposium on intelligent data analysis, pp. 309–318. Springer, 2001.
- Settles, B. Active learning literature survey. 2009.
- An analysis of active learning strategies for sequence labeling tasks. In proceedings of the 2008 conference on empirical methods in natural language processing, pp. 1070–1079, 2008.
- A survey on active learning: State-of-the-art, practical challenges and research directions. Mathematics, 11(4):820, 2023.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. arXiv:2307.09288 [cs].
- Wilcoxon, F. Individual comparisons by ranking methods. Biometrics, 1(6):80–83, 1945.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. ArXiv preprint, abs/2306.13063, 2023.
- Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153, 2023.
- HellaSwag: Can a Machine Really Finish Your Sentence? In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472.
- R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677, 2023.
- Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. arXiv preprint arXiv:2401.06730, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.