Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uncertainty in Language Models: Assessment through Rank-Calibration (2404.03163v2)

Published 4 Apr 2024 in cs.CL, cs.AI, cs.LG, and stat.ML
Uncertainty in Language Models: Assessment through Rank-Calibration

Abstract: LLMs (LMs) have shown promising performance in natural language generation. However, as LMs often generate incorrect or hallucinated responses, it is crucial to correctly quantify their uncertainty in responding to given inputs. In addition to verbalized confidence elicited via prompting, many uncertainty measures ($e.g.$, semantic entropy and affinity-graph-based measures) have been proposed. However, these measures can differ greatly, and it is unclear how to compare them, partly because they take values over different ranges ($e.g.$, $[0,\infty)$ or $[0,1]$). In this work, we address this issue by developing a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs. Our key tenet is that higher uncertainty (or lower confidence) should imply lower generation quality, on average. Rank-calibration quantifies deviations from this ideal relationship in a principled manner, without requiring ad hoc binary thresholding of the correctness score ($e.g.$, ROUGE or METEOR). The broad applicability and the granular interpretability of our methods are demonstrated empirically.

Rank-Calibration: A Framework for Assessing Uncertainty in LLMs

Introduction

LLMs (LMs), specifically LLMs, have significantly advanced the field of Natural Language Generation (NLG). Despite their potential, these models often produce incorrect or hallucinated responses. It is, therefore, crucial to accurately quantify the level of uncertainty in their outputs. This work introduces a novel framework known as Rank-Calibration for the assessment of uncertainty and confidence measures for LMs in NLG tasks. The framework is built on the principle that lower uncertainty or higher confidence should ideally correlate with higher generation quality. Utilizing the Rank-Calibration Error (RCE) metric, this framework offers a principled approach to quantify deviations from the ideal relationship between uncertainty levels and generation quality.

Uncertainty Measures for LLMs

Existing uncertainty measures for LMs focus on capturing the dispersion of potential outputs for a given input. Notable among these are semantic entropy, which accounts for linguistic invariances among generated responses, and affinity-graph-based measures that leverage the structural properties of response similarities. The diversity in these measures' output ranges and their conceptual bases necessitates a universal assessment framework that can adapt to their inherent differences.

The Rank-Calibration Framework

The Rank-Calibration framework assesses the quality of uncertainty measures based on the principle that higher-quality generations should correspond to lower uncertainty levels. This is encapsulated in the Rank-Calibration Error (RCE) metric, which quantifies the deviation from the desired monotonic relationship between uncertainty levels and expected generation quality. The framework extends to assess confidence measures by evaluating deviations from expected versus observed confidence levels.

Empirical RCE and Indication Diagrams

To practically implement the Rank-Calibration framework, the empirical RCE is introduced, utilizing a piecewise constant regression strategy. This involves binning uncertainty values and calculating average correctness within each bin to estimate the ideal monotonic relationship. Additionally, indication diagrams provide visual insights into the performance of uncertainty measures, highlighting regions of over-optimism or pessimism in uncertainty estimations.

Experimental Demonstration

Through comprehensive experiments, the Rank-Calibration framework's wider applicability and interpretability are showcased. The framework's robustness is further validated across varying LMs, datasets, and correctness measures. Notably, the empirical RCE enables a detailed analysis of uncertainty measures' performance, identifying those that consistently align with the expectation of lower uncertainty correlating with higher generation quality.

Theoretical Insights

The notion of Rank-Calibration extends beyond current calibration concepts in classification tasks, offering a more generalized perspective on measuring uncertainty, especially in NLG tasks. This work demonstrates that good rank-calibration in uncertainty measures can be achieved through post-hoc recalibration, improving alignment with generation quality expectations.

Conclusion and Future Directions

The Rank-Calibration framework introduces a novel and effective approach to assessing uncertainty and confidence in LMs. By focusing on the rank-order of uncertainty levels relative to generation quality, this framework provides a more interpretable and adaptable method for evaluating LM outputs. Future research directions include developing inherently rank-calibrated uncertainty measures and integrating rank-calibration into generative pipelines for LMs, aiming to enhance the reliability and usefulness of generated responses in practical applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  2. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  3. Glenn W Brier. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3.
  4. INSIDE: LLMs’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations.
  5. Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175.
  6. A discourse-aware attention model for abstractive summarization of long documents.
  7. Morris H DeGroot and Stephen E Fienberg. 1983. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22.
  8. Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7:1–30.
  9. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR.
  10. Tilmann Gneiting and Adrian E Raftery. 2007. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378.
  11. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  12. Calibration of neural networks using splines. In International Conference on Learning Representations.
  13. Frank E Harrell. 2015. Regression modeling strategies with applications to linear models, logistic and ordinal regression, and survival analysis.
  14. Deberta: Decoding-enhanced bert with disentangled attention.
  15. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
  16. Maximizing overall diversity for improved uncertainty estimates in deep ensembles. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4264–4271.
  17. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63.
  18. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  19. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  20. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
  21. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32.
  22. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
  23. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  24. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.
  25. T-cal: An optimal test for the calibration of predictive models. Journal of Machine Learning Research, 24(335):1–72.
  26. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations.
  27. Calibration of probabilities: The state of the art. In Decision Making and Change in Human Affairs: Proceedings of the Fifth Research Conference on Subjective Probability, Utility, and Decision Making, Darmstadt, 1–4 September, 1975, pages 275–324. Springer.
  28. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  29. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research.
  30. Generating with confidence: Uncertainty quantification for black-box large language models.
  31. Andrey Malinin and Mark Gales. 2021. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations.
  32. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  33. Meta. 2023. Llama access request form - meta ai. https://ai.meta.com/resources/models-and-libraries/llama-downloads/. (Accessed on 12/13/2023).
  34. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  35. Jacob A Mincer and Victor Zarnowitz. 1969. The evaluation of economic forecasts. In Economic forecasts and expectations: Analysis of forecasting behavior and performance, pages 3–46. NBER.
  36. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
  37. Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In International Conference on Machine Learning, pages 625–632.
  38. Measuring calibration in deep learning. In CVPR workshops, volume 2.
  39. OpenAI. 2023. Gpt-4 technical report.
  40. Training language models to follow instructions with human feedback.
  41. Confidence estimation methods for neural networks: A practical comparison. IEEE transactions on neural networks, 12(6):1278–1287.
  42. Nicolas Papernot and Patrick McDaniel. 2018. Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765.
  43. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  44. John Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
  45. Squad: 100,000+ questions for machine comprehension of text.
  46. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.
  47. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations.
  48. Leonard J Savage. 1971. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801.
  49. Re-examining calibration: The case of question answering. arXiv preprint arXiv:2205.12507.
  50. Quantifying uncertainty in natural language explanations of large language models. arXiv preprint arXiv:2311.03533.
  51. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  53. Llama 2: Open foundation and fine-tuned chat models.
  54. Alexandre B Tsybakov. 2009. Introduction to Nonparametric Estimation. Springer.
  55. A comprehensive survey on summarization techniques. SN Computer Science, 4:1–9.
  56. CORD-19: The COVID-19 open research dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online. Association for Computational Linguistics.
  57. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  58. Scoring rules and the evaluation of probabilities. Test, 5:1–60.
  59. Huggingface’s transformers: State-of-the-art natural language processing.
  60. Yijun Xiao and William Yang Wang. 2021. On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:2103.15025.
  61. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations.
  62. Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Intertional Conference on Machine Learning, volume 1, pages 609–616.
  63. Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699.
  64. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xinmeng Huang (23 papers)
  2. Shuo Li (179 papers)
  3. Mengxin Yu (15 papers)
  4. Matteo Sesia (33 papers)
  5. Hamed Hassani (120 papers)
  6. Insup Lee (68 papers)
  7. Osbert Bastani (97 papers)
  8. Edgar Dobriban (75 papers)
Citations (12)