Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? (2405.16908v2)

Published 27 May 2024 in cs.CL
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

Abstract: We posit that LLMs should be capable of expressing their intrinsic uncertainty in natural language. For example, if the LLM is equally likely to output two contradicting answers to the same question, then its generated response should reflect this uncertainty by hedging its answer (e.g., "I'm not sure, but I think..."). We formalize faithful response uncertainty based on the gap between the model's intrinsic confidence in the assertions it makes and the decisiveness by which they are conveyed. This example-level metric reliably indicates whether the model reflects its uncertainty, as it penalizes both excessive and insufficient hedging. We evaluate a variety of aligned LLMs at faithfully communicating uncertainty on several knowledge-intensive question answering tasks. Our results provide strong evidence that modern LLMs are poor at faithfully conveying their uncertainty, and that better alignment is necessary to improve their trustworthiness.

Faithful Expression of Uncertainty in LLMs

The paper investigates the capacity of LLMs to express their intrinsic uncertainty in natural language responses. The authors posit that such expressions—often through hedging language—can improve the trustworthiness of LLMs by better aligning the decisiveness of their answers with their underlying confidence. This paper provides a detailed examination of LLM performances on conveying uncertainty and proposes a new metric to quantify this ability.

Problem Statement and Contributions

LLMs are known for their high fluency and persuasiveness, which can sometimes result in confidently delivered but incorrect answers. This issue is particularly problematic in knowledge-intensive question answering (QA) tasks, where users might overly rely on the model’s outputs. The authors argue that one way to mitigate this problem is for LLMs to verbalize their uncertainty directly within their generated responses.

The paper makes the following key contributions:

  • Formalization of Faithful Response Uncertainty: The authors introduce an example-level metric, Faithful Response Uncertainty, to measure the gap between a model’s intrinsic confidence and the decisiveness of its assertions.
  • Implementation of Decisiveness and Confidence Scoring: The paper employs Gemini Ultra for assessing decisiveness and confidence, ensuring that these measures align with human judgment.
  • Empirical Evaluation: The research evaluates several leading LLMs (including variants from the Gemini family, GPT-3.5, and GPT-4) on datasets like Natural Questions and PopQA, assessing their ability to express uncertainty faithfully.

Methodology

Formalization

The metric Faithful Response Uncertainty is defined to measure the gap between the model’s confidence in a generated assertion and the decisiveness with which it is expressed. Decisiveness is derived from potential hedging expressions, and confidence is assessed through consistency across multiple re-sampled responses. The formal definition ensures that models are penalized for both excessive and insufficient hedging.

Implementation

The research uses:

  • Decisiveness Measurement: Quantified by the probability that an agent will deem an assertion true, based on the generated response.
  • Confidence Measurement: Derived from the consistency of a given assertion with re-sampled answers.

Gemini Ultra serves as the judge model for these assessments, using specific prompts crafted to capture the nuanced nature of decisiveness and confidence.

Results & Findings

The evaluation reveals that state-of-the-art LLMs perform poorly in faithfully expressing their uncertainty. Key findings include:

  • Decisive Responses: Most LLMs, when using standard decoding techniques, produced highly decisive answers despite significant intrinsic uncertainty.
  • Inconsistent Hedging: When prompted to express uncertainty, the hedges used by LLMs did not consistently align with their intrinsic uncertainty levels. This misalignment often resulted in both under-hedging (decisive answers despite low confidence) and over-hedging (hedged answers despite high confidence).

Implications and Future Directions

The findings underscore the necessity for better alignment techniques in LLMs to ensure that the decisiveness of their outputs accurately reflects their internal confidence. This alignment is crucial for enhancing the reliability and trustworthiness of these models, particularly in applications where incorrect or over-confident answers could have significant consequences.

In terms of theoretical implications, this paper contributes a framework for understanding and evaluating the expression of uncertainty in natural language by LLMs. Practically, the research suggests directions for improving model design and training protocols to incorporate mechanisms for uncertainty expression faithfully.

Conclusion

The paper makes a significant contribution by highlighting a critical shortcoming of current LLMs: their inability to faithfully express uncertainty in their responses. The proposed metric and methodology for evaluating this capability are robust and align well with human judgment. However, the empirical evaluation demonstrates that existing models fall short of this standard, indicating a pressing need for advancement in this area.

Future research could explore new training techniques, model architectures, or alignment algorithms that prioritize and enhance the faithful expression of intrinsic uncertainty in LLMs. This would not only improve the trustworthiness of these models but also expand their applicability in critical and sensitive information domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
  3. Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703.
  4. To trust or to think: cognitive forcing functions can reduce overreliance on ai in ai-assisted decision-making. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1):1–21.
  5. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827.
  6. A Philip Dawid. 1982. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605–610.
  7. Did it happen? the pragmatic complexity of veridicality assessment. Computational linguistics, 38(2):301–333.
  8. Ran El-Yaniv et al. 2010. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5).
  9. Wade Fagen-Ulmschneider. 2023. Perception of probability words. Ms., UIUC, 05-24-2023.
  10. Bruce Fraser. 2010. Pragmatic competence: The case of hedging. In New approaches to hedging, pages 15–34. Brill.
  11. Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. Advances in neural information processing systems, 30.
  12. Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904.
  13. Gemini-Team. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  14. Anastasia Giannakidou. 1999. Affective dependencies. Linguistics and Philosophy, 22:367–421.
  15. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  16. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  17. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
  18. “i am uncertain” vs “it is uncertain”. how linguistic markers of the uncertainty source affect uncertainty communication. Judgment and Decision Making, 12(5):445–465.
  19. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  20. Selective question answering under domain shift. arXiv preprint arXiv:2006.09462.
  21. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.
  22. Unfamiliar finetuning examples control how language models hallucinate. arXiv preprint arXiv:2403.05612.
  23. " i’m not sure, but…": Examining the impact of large language models’ uncertainty expression on user reliance and trust. arXiv preprint arXiv:2405.00623.
  24. Svenja Kranich. 2011. To hedge or not to hedge: the use of epistemic modal expressions in popular science in english texts, english–german translations, and german original texts.
  25. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
  26. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  27. George Lakoff. 1973. Hedges: A study in meaning criteria and the logic of fuzzy concepts. Journal of philosophical logic, 2(4):458–508.
  28. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  29. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
  30. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
  31. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511.
  32. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  33. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  34. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
  35. Samir Passi and Mihaela Vorvoreanu. 2022. Overreliance on ai literature review. Microsoft Research.
  36. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  37. Llms can learn self-restraint through iterative self-reflection. arXiv preprint arXiv:2405.13022.
  38. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401.
  39. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975.
  40. Generation probabilities are not enough: Exploring the effectiveness of uncertainty highlighting in ai-powered code completions. arXiv preprint arXiv:2302.07248.
  41. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.
  42. Preferences and reasons for communicating probabilistic information in verbal or numerical terms. Bulletin of the Psychonomic Society, 31(2):135–138.
  43. Paul D Windschitl and Gary L Wells. 1996. Measuring psychological uncertainty: Verbal versus numeric methods. Journal of Experimental Psychology: Applied, 2(4):343.
  44. Rejection improves reliability: Training llms to refuse unknown questions using rl from knowledge feedback. arXiv preprint arXiv:2403.18349.
  45. Alignment for honesty. arXiv preprint arXiv:2312.07000.
  46. Narrowing the knowledge evaluation gap: Open-domain question answering with multi-granularity answers. arXiv preprint arXiv:2401.04695.
  47. Hiyori Yoshikawa and Naoaki Okazaki. 2023. Selective-LAMA: Selective prediction for confidence-aware evaluation of language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2017–2028, Dubrovnik, Croatia. Association for Computational Linguistics.
  48. Self-alignment for factuality: Mitigating hallucinations in llms via self-evaluation. arXiv preprint arXiv:2402.09267.
  49. Alf C Zimmer. 1983. Verbal vs. numerical processing of subjective probabilities. In Advances in psychology, volume 16, pages 159–182. Elsevier.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Gal Yona (21 papers)
  2. Roee Aharoni (35 papers)
  3. Mor Geva (58 papers)
Citations (7)