Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linguistic Calibration of Long-Form Generations (2404.00474v2)

Published 30 Mar 2024 in cs.LG, cs.AI, cs.CL, and stat.ML
Linguistic Calibration of Long-Form Generations

Abstract: LLMs (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making.

Overview of the Linguistic Calibration of LLMs

The paper entitled "Linguistic Calibration of LLMs" addresses the pivotal problem of LLMs (LMs) leading users to suboptimal decisions due to confident hallucinations. This issue arises prominently when LMs output information that seems sure of its accuracy but is, in fact, incorrect. The paper introduces the concept of "linguistic calibration," targeting the alignment of expressed confidence in LM outputs with the actual probability of correctness, especially in long-form textual outputs that influence decision-making processes.

Definition and Framework for Linguistic Calibration

The authors propose a formal definition of linguistic calibration, centered on facilitating users to make probabilistic forecasts based on LM outputs that are appropriately aligned with the model's true level of certainty. This involves structuring LM training to include processes that convey confidence levels, using statements like "I estimate a 30% chance of..." in natural language that aligns with the actual likelihood of correctness.

The training framework designed to achieve this involves a two-step process:

  1. Summary Distillation: A supervised finetuning step that consolidates multiple generations into a coherent summary reflecting varied confidence statements.
  2. Decision-Based Reinforcement Learning (RL): An RL step encouraging LMs to output text enabling user-level calibration suitable for downstream decision tasks. This stage applies proper scoring rules typical in decision theory to align the LMs output with the true knowledge base when faced with a decision-making process.

Evaluation and Results

The paper's empirical evaluations focused on finetuning the Llama 2 7B LLM using this framework, showing significant improvements over baselines finetuned for factuality. The linguistic calibration approach improved calibration on both automated and human-evaluated metrics without sacrificing accuracy. The model also demonstrated zero-shot transferability across various tasks, performing well on both in-domain and out-of-distribution question-answer datasets, as well as a separate task of biography generation.

Notably, the model showed improved forecast Expected Calibration Error (ECE) while maintaining competitive prediction accuracy. This supports the efficacy of the proposed framework in practical applications where LM outputs influence decision-making.

Implications and Future Directions

This research has relevant practical implications. By aligning the model's stated confidence levels with its actual correctness probabilities, LMs can foster better trust and reliability in applications ranging from medical and legal decision support systems to everyday queries. The paper opens the possibility of widespread adoption of linguistic calibration in enhancing the interpretability and trustworthiness of LMs, especially for end-users who rely on models for critical information and decisions.

Looking forward, future developments could enhance user-specific calibrations, allowing adjustments tailored to individual user profiles or situational contexts. Moreover, refining the understanding of human interpretations of linguistic confidence could inform more nuanced calibrations, fostering better LM-user interactions.

In summary, this paper advances the field by addressing the interpretability of LMs through linguistic calibration, promoting an integrative approach that aligns LM outputs more closely with reality, and thereby fostering informed and reliable decision-making processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Deductive closure training of language models for coherence, accuracy, and updatability, 2024.
  2. Anthropic. Model card and evaluations for claude models, 2023.
  3. A general language assistant as a laboratory for alignment, 2021.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  5. Benchmarking bayesian deep learning on diabetic retinopathy detection tasks. In NeurIPS Datasets and Benchmarks Track, 2021.
  6. Weight Uncertainty in Neural Networks. In Francis Bach and David Blei, editors, PMLR, volume 37 of Proceedings of Machine Learning Research, pages 1613–1622, Lille, France, 07–09 Jul 2015. PMLR.
  7. Convex optimization. Cambridge University Press, 2004.
  8. Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3, 1950. doi: 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2. URL https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml.
  9. Jochen Bröcker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643):1512–1519, 2009. doi: https://doi.org/10.1002/qj.456. URL https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.456.
  10. Can ai assistants know what they don’t know?, 2024.
  11. Elements of Information Theory. Wiley, New York, 1991.
  12. Conformal prediction sets improve human decision making, 2024.
  13. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  14. Large legal fictions: Profiling legal hallucinations in large language models, 2024.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  16. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  17. A Philip Dawid. Present position and potential developments: Some personal views statistical theory the prequential approach. Journal of the Royal Statistical Society: Series A (General), 147(2):278–290, 1984.
  18. The comparison and evaluation of forecasters, 1983.
  19. 8-bit optimizers via block-wise quantization, 2022.
  20. Alpacafarm: A simulation framework for methods that learn from human feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=4hturzLcKX.
  21. Truthful ai: Developing and governing ai that does not lie, 2021.
  22. Regret in the on-line decision problem. Games and Economic Behavior, 29(1):7–35, 1999. ISSN 0899-8256. doi: https://doi.org/10.1006/game.1999.0740. URL https://www.sciencedirect.com/science/article/pii/S0899825699907406.
  23. Asymptotic calibration. Biometrika, 85(2):379–390, 1998. ISSN 00063444. URL http://www.jstor.org/stable/2337364.
  24. Dropout As a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, Icml 2016, pages 1050–1059, 2016.
  25. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007. doi: 10.1198/016214506000001437. URL https://doi.org/10.1198/016214506000001437.
  26. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/guo17a.html.
  27. Multicalibration: Calibration for the (Computationally-identifiable) masses. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1939–1948. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/hebert-johnson18a.html.
  28. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023.
  29. Calibrating long-form generations from large language models, 2024.
  30. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12), mar 2023. ISSN 0360-0300. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730.
  31. How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics, 9:962–977, 09 2021. ISSN 2307-387X. doi: 10.1162/tacl_a_00407. URL https://doi.org/10.1162/tacl_a_00407.
  32. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, July 2017. Association for Computational Linguistics.
  33. Language models (mostly) know what they know, 2022.
  34. Kaggle. 200,000+ jeopardy! questions, 2020. URL https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions/data.
  35. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023.
  36. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Annalisa Appice, Pedro Pereira Rodrigues, Vítor Santos Costa, Carlos Soares, João Gama, and Alípio Jorge, editors, Machine Learning and Knowledge Discovery in Databases, pages 68–85, Cham, 2015. Springer International Publishing. ISBN 978-3-319-23528-8.
  37. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32, 2019.
  38. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6402–6413, 2017.
  39. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-demo.21.
  40. Holistic evaluation of language models, 2023.
  41. Teaching models to express their uncertainty in words, 2022.
  42. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  43. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jN5y-zb5Q7m.
  44. Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
  45. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl_a_00494. URL https://aclanthology.org/2022.tacl-1.50.
  46. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https://aclanthology.org/2023.emnlp-main.741.
  47. Revisiting the calibration of modern neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15682–15694. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/8420d359404024567b5aefda1231af24-Paper.pdf.
  48. Allan H Murphy. A new vector partition of the probability score. Journal of Applied Meteorology and Climatology, 12(4):595–600, 1973.
  49. Uncertainty baselines: Benchmarks for uncertainty & robustness in deep learning, 2022.
  50. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, page 625–632, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1595931805. doi: 10.1145/1102351.1102430. URL https://doi.org/10.1145/1102351.1102430.
  51. Gpt-4 technical report, 2023.
  52. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155.
  53. Ai deception: A survey of examples, risks, and potential solutions, 2023.
  54. John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
  55. Large language models in medicine. Nature Medicine, 29(8):1930–1940, Aug 2023. ISSN 1546-170X. doi: 10.1038/s41591-023-02448-8. URL https://doi.org/10.1038/s41591-023-02448-8.
  56. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URL https://aclanthology.org/2023.emnlp-main.330.
  57. Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=WPZ2yPag4K.
  58. together.ai. Releasing 3b and 7b redpajama-incite family of models including base, instruction-tuned & chat models, 2023. URL https://www.together.ai/blog/redpajama-models-v1.
  59. Plex: Towards reliability using pretrained large model extensions. In ICML 2022 Workshop on Pre-training, 2022.
  60. Thomas Wallsten. Measuring Vague Uncertainties and Understanding Their Use in Decision Making, pages 377–398. Measuring Vague Uncertainties and Understanding Their Use in Decision Making, 01 1990. ISBN 978-90-481-5785-3. doi: 10.1007/978-94-015-7873-8_15.
  61. Self-consistency improves chain of thought reasoning in language models, 2022. URL https://arxiv.org/abs/2203.11171.
  62. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
  63. Fair work: Crowd work minimum wage with one line of code. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 7(1):197–206, Oct. 2019. doi: 10.1609/hcomp.v7i1.5283. URL https://ojs.aaai.org/index.php/HCOMP/article/view/5283.
  64. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  65. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gjeQKFxFpZ.
  66. Alignment for honesty, 2023.
  67. Benchmarking llms via uncertainty quantification, 2024.
  68. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616, 2001.
  69. Right decisions from wrong predictions: A mechanism design alternative to individual calibration. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2683–2691. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/zhao21a.html.
  70. Calibrating predictions to decisions: A novel approach to multi-class calibration. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 22313–22324. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/bbc92a647199b832ec90d7cf57074e9e-Paper.pdf.
  71. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023.
  72. Navigating the grey area: How expressions of uncertainty and overconfidence affect language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5506–5524, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.335. URL https://aclanthology.org/2023.emnlp-main.335.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Neil Band (9 papers)
  2. Xuechen Li (35 papers)
  3. Tengyu Ma (117 papers)
  4. Tatsunori Hashimoto (80 papers)
Citations (13)
Youtube Logo Streamline Icon: https://streamlinehq.com