Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Calibrating Large Language Models Using Their Generations Only (2403.05973v1)

Published 9 Mar 2024 in cs.CL, cs.AI, and cs.LG
Calibrating Large Language Models Using Their Generations Only

Abstract: As LLMs are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary prediction of confidence targets): A method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone. This approach has several advantages: It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages, for instance by verbalizing the predicted confidence or adjusting the given answer based on the confidence. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.

Calibrating LLMs Through Auxiliary Models Predicting Confidence

Introduction to APRICOT

In the field of LLMing, ensuring that LLMs can provide not just any responses but reliable and trustworthy ones is paramount, especially as these models find more applications in user-facing services. A significant challenge in this context is the calibration of LLMs; specifically, how can one quantify and enhance the model's confidence in its own predictions when interaction with the model is limited to its generated text? The paper introduces APRICOT (Auxiliary prediction of confidence targets), a novel method tackling this problem by training an auxiliary model to predict the confidence of an LLM's answers solely based on the textual input and output.

Key Contributions

The paper positions APRICOT as a straightforward and conceptually simple approach to calibrating LLMs that does not require access to the model beyond its outputs. This is particularly useful given the increasing prevalence of black-box LLMs offered as services, where internal model details or token probabilities are not accessible. The auxiliary model trained by APRICOT provides valuable information about the LLM's confidence in its answers without interfering with the language generation process, making it highly versatile and applicable across various implementations and scenarios. The authors empirically demonstrate APRICOT's effectiveness in reducing calibration error for both white-box and black-box LLMs on closed-book question-answering tasks, specifically focusing on the ability to detect incorrect answers.

Methodological Overview

APRICOT stands out by obtaining calibration targets without requiring additional information about the LLM's internals or question metadata. Instead, it utilizes the text input given to and output produced by the LLM to predict calibration targets, which are derived via clustering similar questions based on their embeddings. This clustering forms the basis for setting confidence targets without direct access to the LLM's predictions, an approach that is not only innovative but also practical, considering the operational parameters of many LLMs today.

Experimentation and Results

The experiments conducted to validate APRICOT's approach are thorough in their methodology and analysis. The authors used datasets such as TriviaQA and CoQA for testing, with both white-box (Vicuna v1.5) and black-box (GPT-3.5) LLMs. APRICOT demonstrated competitive performance in terms of calibration error while also significantly outperforming baselines in detecting incorrect model answers across different scenarios and configurations. Notably, APRICOT effectively calibrated LLMs using both fine-grained targets obtained through clustering and a binary approach that focused on answer correctness.

Practical Implications and Future Directions

This work underscores the importance of LLM confidence in improving user trust and safety in AI applications. APRICOT presents a practical solution to a previously intractable problem, offering a pathway to more reliable and interpretable AI without requiring invasive access or modifications to the underlying models. Looking forward, the techniques presented here could extend to other domains of AI beyond text generation, offering a general method for enhancing model reliability across the board.

Conclusion

In summary, APRICOT offers a compelling approach to the calibration of LLMs through an auxiliary model that requires no internal model access. By leveraging textual inputs and outputs for confidence prediction, APRICOT paves the way for more trustworthy and safe applications of LLMs in real-world scenarios. The method's simplicity, effectiveness, and versatility stand to significantly impact the future development and deployment of LLMs across various industries and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Anastasios N Angelopoulos and Stephen Bates. 2021. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511.
  2. Anonymous. 2024. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations.
  3. Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703.
  4. Mars: Meaning-aware response scoring for uncertainty estimation in generative llms. arXiv preprint arXiv:2402.11756.
  5. Exploring prediction uncertainty in machine translation quality estimation. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 208–218, Berlin, Germany. Association for Computational Linguistics.
  6. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 401–413.
  7. Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
  8. Jarosław Błasiok and Preetum Nakkiran. 2023. Smooth ece: Principled reliability diagrams via kernel smoothing. arXiv preprint arXiv:2309.12236.
  9. Confidence estimation for machine translation. In Coling 2004: Proceedings of the 20th international conference on computational linguistics, pages 315–321.
  10. Glenn W Brier. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3.
  11. Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining, pages 160–172. Springer.
  12. Ilias Chalkidis. 2023. Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue benchmark. arXiv preprint arXiv:2304.12202.
  13. Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175.
  14. Reconfidencing llms from the grouping loss perspective. arXiv preprint arXiv:2402.04957.
  15. A close look into the calibration of pre-trained language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1343–1367. Association for Computational Linguistics.
  16. ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  17. Large legal fictions: Profiling legal hallucinations in large language models. arXiv preprint arXiv:2401.01301.
  18. Underspecification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237–10297.
  19. An optimal transportation approach for assessing almost stochastic order. In The Mathematics of the Uncertain, pages 33–44. Springer.
  20. Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 295–302. Association for Computational Linguistics.
  21. A diachronic perspective on user trust in ai under uncertainty. arXiv preprint arXiv:2310.13544.
  22. Deep dominance - how to properly compare deep neural models. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2773–2785. Association for Computational Linguistics.
  23. Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110–120.
  24. Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press.
  25. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pages 226–231. AAAI Press.
  26. Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3920–3938, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  27. Perspectives on the state and future of deep learning–2023. arXiv preprint arXiv:2312.09323.
  28. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  29. Survey on leveraging uncertainty estimation towards trustworthy deep neural networks: The case of reject option and post-training processing. arXiv preprint arXiv:2304.04906.
  30. The practical implementation of artificial intelligence technologies in medicine. Nature medicine, 25(1):30–36.
  31. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  32. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  33. Benedikt Höltgen and Robert C Williamson. 2023. On the richness of calibration. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1124–1138.
  34. Decomposing uncertainty for large language models through input clarification ensembling. arXiv preprint arXiv:2311.08718.
  35. Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in ai. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 624–635.
  36. Calibrating language models via augmented prompt ensembles.
  37. How can we know When language models know? on the calibration of language models for question answering. Trans. Assoc. Comput. Linguistics, 9:962–977.
  38. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association for Computational Linguistics.
  39. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  40. Quantifying the carbon emissions of machine learning. Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019.
  41. DEUP: direct epistemic uncertainty prediction. Trans. Mach. Learn. Res., 2023.
  42. Question and answer test-train overlap in open-domain question answering datasets. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 1000–1008. Association for Computational Linguistics.
  43. Q Vera Liao and S Shyam Sundar. 2022. Designing for responsible trust in ai systems: A communication perspective. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1257–1268.
  44. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  45. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022.
  46. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  47. Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam.
  48. Energy usage reports: Environmental awareness as part of algorithmic accountability. Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019.
  49. Andrey Malinin and Mark J. F. Gales. 2021. Uncertainty estimation in autoregressive structured prediction. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  50. Gary Marcus. 2020. The next decade in ai: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177.
  51. Comparison of deep learning models and various text pre-processing techniques for the toxic comments classification. Applied Sciences, 10(23):8631.
  52. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  53. Deep deterministic uncertainty: A new simple baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24384–24394.
  54. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
  55. Lucia Nalbandian. 2022. An eye for an ‘i:’a critical assessment of artificial intelligence tools in migration and asylum management. Comparative Migration Studies, 10(1):1–23.
  56. OpenAI. 2022. Introducing chatgpt.
  57. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32.
  58. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. arXiv preprint arXiv:2309.15840.
  59. Inductive confidence machines for regression. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pages 345–356. Springer.
  60. John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
  61. Christopher Quirk. 2004. Training a sentence-level machine translation confidence measure. In LREC.
  62. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
  63. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics.
  64. CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning Computing.
  65. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
  66. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5433–5442. Association for Computational Linguistics.
  67. William Timkey and Marten van Schijndel. 2021. All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 4527–4546. Association for Computational Linguistics.
  68. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  69. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388.
  70. Exploring predictive uncertainty and calibration in NLP: A study on the impact of method & data scarcity. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2707–2735. Association for Computational Linguistics.
  71. deep-significance: Easy and meaningful signifcance testing in the age of neural networks. In ML Evaluation Standards Workshop at the Tenth International Conference on Learning Representations.
  72. Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data. In Machine Learning for Health, pages 341–354. PMLR.
  73. Intensive care unit physicians’ perspectives on artificial intelligence–based clinical decision support tools: Preimplementation survey study. JMIR Human Factors, 10:e39114.
  74. Benchmarking scalable predictive uncertainty in text classification. IEEE Access.
  75. Hybrid uncertainty quantification for selective text classification in ambiguous tasks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 11659–11681. Association for Computational Linguistics.
  76. Algorithmic learning in a random world, volume 29. Springer.
  77. Improving back-translation with uncertainty-based confidence estimation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 791–802. Association for Computational Linguistics.
  78. Understanding how dimension reduction tools work: an empirical approach to deciphering t-sne, umap, trimap, and pacmap for data visualization. The Journal of Machine Learning Research, 22(1):9129–9201.
  79. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  80. Fine-grained human feedback gives better rewards for language model training. In Thirty-seventh Conference on Neural Information Processing Systems.
  81. Wat zei je? detecting out-of-distribution translations with variational transformers. arXiv preprint arXiv:2006.08344.
  82. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. CoRR, abs/2306.13063.
  83. Bayesian low-rank adaptation for large language models. arXiv preprint arXiv:2308.13111.
  84. Disentangling uncertainty in machine translation evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 8622–8641. Association for Computational Linguistics.
  85. Better uncertainty quantification for machine translation evaluation. arXiv e-prints, pages arXiv–2204.
  86. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  87. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. arXiv preprint arXiv:2401.06730.
  88. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dennis Ulmer (17 papers)
  2. Martin Gubri (12 papers)
  3. Hwaran Lee (31 papers)
  4. Sangdoo Yun (71 papers)
  5. Seong Joon Oh (60 papers)
Citations (10)