Thermometer: Towards Universal Calibration for Large Language Models (2403.08819v2)
Abstract: We consider the issue of calibration in LLMs (LLM). Recent studies have found that common interventions such as instruction tuning often result in poorly calibrated LLMs. Although calibration is well-explored in traditional applications, calibrating LLMs is uniquely challenging. These challenges stem as much from the severe computational requirements of LLMs as from their versatility, which allows them to be applied to diverse tasks. Addressing these challenges, we propose THERMOMETER, a calibration approach tailored to LLMs. THERMOMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks. Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
- Improving question answering by commonsense-based pre-training. In Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part I 8, pages 16–28, 2019.
- Incorporating BERT into neural machine translation. In International Conference on Learning Representations, 2020.
- OpenAI. GPT-4 technical report, 2023.
- On the calibration of large language models and alignment. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9778–9795, Singapore, December 2023. Association for Computational Linguistics.
- Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
- A Philip Dawid. The well-calibrated Bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982.
- Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B, 69(2):243–268, 2007.
- John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017.
- Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059. PMLR, 2016.
- On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Teaching models to express their uncertainty in words. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
- Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In International Conference on Machine Learning, volume 1, pages 609–616, 2001.
- Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of AAAI, volume 29, 2015.
- Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699, 2002.
- Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in Neural Information Processing Systems, 32, 2019.
- Sample-dependent adaptive temperature scaling for improved calibration. In Proceedings of AAAI, volume 37, pages 14919–14926, 2023.
- Robust calibration with multi-domain temperature scaling. Advances in Neural Information Processing Systems, 35:27510–27523, 2022.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
- Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
- Calibrating deep neural networks using focal loss. Advances in Neural Information Processing Systems, 33:15288–15299, 2020.
- What are Bayesian neural network posteriors really like? In International conference on machine learning, pages 4629–4640. PMLR, 2021.
- How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021.
- Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. arXiv preprint arXiv:2210.04714, 2022.
- A close look into the calibration of pre-trained language models. arXiv preprint arXiv:2211.00151, 2022.
- On the calibration of pre-trained language models using mixup guided by area under the margin and saliency. arXiv preprint arXiv:2203.07559, 2022.
- Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, 2020.
- Knowing more about questions can help: Improving calibration in question answering. arXiv preprint arXiv:2106.01494, 2021.
- Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872, 2022.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, December 2023.
- The Helmholtz machine. Neural computation, 7(5):889–904, 1995.
- Multi-level variational autoencoder: Learning disentangled representations from grouped observations. In Proceedings of AAAI, volume 32, 2018.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Mrqa 2019 shared task: Evaluating generalization in reading comprehension. arXiv preprint arXiv:1910.09753, 2019.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
- Are you using test log-likelihood correctly? Transactions on Machine Learning Research, 2024. ISSN 2835-8856.
- Top-label calibration and multiclass-to-binary reductions. arXiv preprint arXiv:2107.08353, 2021.
- EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388. Association for Computational Linguistics, November 2019.