Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty (2401.06730v2)

Published 12 Jan 2024 in cs.CL, cs.AI, and cs.HC
Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

Abstract: As natural language becomes the default interface for human-AI interaction, there is a need for LMs to appropriately communicate uncertainties in downstream applications. In this work, we investigate how LMs incorporate confidence in responses via natural language and how downstream users behave in response to LM-articulated uncertainties. We examine publicly deployed models and find that LMs are reluctant to express uncertainties when answering questions even when they produce incorrect responses. LMs can be explicitly prompted to express confidences, but tend to be overconfident, resulting in high error rates (an average of 47%) among confident responses. We test the risks of LM overconfidence by conducting human experiments and show that users rely heavily on LM generations, whether or not they are marked by certainty. Lastly, we investigate the preference-annotated datasets used in post training alignment and find that humans are biased against texts with uncertainty. Our work highlights new safety harms facing human-LM interactions and proposes design recommendations and mitigating strategies moving forward.

Introduction to LLMs and Epistemic Markers

LLMs (LMs) like GPT, LLaMA-2, and Claude are at the forefront of human-AI interfaces, facilitating a range of tasks through natural language interaction. A critical aspect of this interface is the models' ability to communicate their confidence—or lack thereof—in their responses. This trustworthiness is particularly consequential in information-seeking scenarios. The use of epistemic markers, linguistic tools that convey the speaker’s certainty, is one way to clearly communicate these uncertainties. However, research shows LMs struggle in expressing uncertainties accurately, which can impair the user's decision-making process when relying on AI-generated information.

Investigating Expression of Uncertainty in LMs

A recent analysis has indicated that LMs, even with explicit prompting, are more likely to express overconfidence. When asked to articulate their confidence level in a response using epistemic markers, LMs use strengtheners (expressions of certainty) more often than weakeners (expressions of uncertainty), despite a significant portion of those confident responses being incorrect.

User Response to LMs' Confidence Expressions

Understanding how users interpret and rely on epistemic markers from LMs is crucial. Studies have found that when LMs provide responses with expressions of high confidence, users tend to heavily rely on them, even when the LMs do not integrate any epistemic markers, implicitly suggesting certainty. Interestingly, even slight inaccuracies in how LMs use these markers can lead to substantial negative effects on user performance over time. The tendency for LMs to convey overconfidence could lead to an over-reliance on AI, highlighting the need for better linguistic calibration between model-generated confidence and actual model accuracy.

Origins of Overconfidence and Potential Mitigations

Investigating the origins of this overconfidence, it appears that the process of reinforcement learning with human feedback (RLHF) plays a pivotal role. Human annotators show a bias against expressions of uncertainty within the texts used in RLHF alignment. These findings suggest a need for corrective action in the design process of LMs to produce more linguistically calibrated responses. Rethinking design strategies could involve generating expressions of uncertainty more naturally and prompting LMs to use plain statements only when the confidence level is authentically high.

Conclusion and Forward Thinking

In conclusion, research shows that current LM practices in expressing uncertainties are not aligned with ideal human-AI communicative standards. LMs' struggle with expressing uncertainties accurately impacts human reliance on AI-generated responses. Identifying the RLHF process as one source of this overconfidence opens the door to reconsider and refine our approach to training LMs, ultimately leading to more reliable and safer human-AI interactions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Alexandra Y Aikhenvald. 2004. Evidentiality. OUP Oxford.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  3. Anthropic. 2022. Introducing claude.
  4. Which humans?
  5. The many meanings of uncertainty in illness: Toward a systematic accounting. Health communication, 10(1):1–23.
  6. Is the most accurate ai the best teammate? optimizing ai for teamwork. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11405–11414.
  7. Beyond accuracy: The role of mental models in human-ai team performance. In Proceedings of the AAAI conference on human computation and crowdsourcing, volume 7, pages 2–11.
  8. Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2429–2437.
  9. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY, USA. Association for Computing Machinery.
  10. Emily Bender. 2019. The# benderrule: On naming the languages we study and why it matters. The Gradient, 14.
  11. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  12. Communication in the management of uncertainty: The case of persons living with hiv or aids. Communications Monographs, 67(1):63–84.
  13. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  14. To trust or to think: Cognitive forcing functions can reduce overreliance on ai in ai-assisted decision-making. Proc. ACM Hum.-Comput. Interact., 5(CSCW1).
  15. Decisions based on numerically and verbally expressed uncertainties. Journal of Experimental Psychology: Human Perception and Performance, 14(2):281.
  16. The role of explanations on trust and reliance in clinical decision support systems. In 2015 international conference on healthcare informatics, pages 160–169. IEEE.
  17. Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 chi conference on human factors in computing systems, pages 1–14.
  18. A close look into the calibration of pre-trained language models. ArXiv, abs/2211.00151.
  19. A case for humans-in-the-loop: Decisions in the presence of erroneous algorithmic scores. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–12.
  20. Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Computational Linguistics.
  21. A diachronic perspective on user trust in AI under uncertainty. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5567–5580, Singapore. Association for Computational Linguistics.
  22. Algorithm aversion: people erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1):114.
  23. Marek J Druzdzel. 1989. Verbal uncertainty expressions: Literature review. Pittsburgh, PA: Carnegie Mellon University, Department of Engineering and Public Policy, pages 1–13.
  24. Expanding explainability: Towards social transparency in ai systems. CHI ’21, New York, NY, USA. Association for Computing Machinery.
  25. Noah D Goodman and Michael C Frank. 2016. Pragmatic language interpretation as probabilistic inference. Trends in cognitive sciences, 20(11):818–829.
  26. Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41–58. Brill.
  27. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  28. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  29. Using pre-training can improve model robustness and uncertainty. ArXiv, abs/1901.09960.
  30. Most people are not weird. Nature, 466(7302):29–29.
  31. Jim Hollan and Scott Stornetta. 1992. Beyond being there. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 119–125.
  32. Ken Hyland. 2005. Stance and engagement: A model of interaction in academic discourse. Discourse studies, 7(2):173–192.
  33. Ken Hyland. 2014. Disciplinary discourses: Writer stance in research articles. In Writing: Texts, processes and practices, pages 99–121. Routledge.
  34. A lexicon-based approach for detecting hedges in informal text. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3109–3113, Marseille, France. European Language Resources Association.
  35. Reiko Itani. 1995. Semantics and pragmatics of hedges in English and Japanese. University of London, University College London (United Kingdom).
  36. How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational psychiatry, 11(1):108.
  37. Abhyuday Jagannatha and Hong Yu. 2020. Calibrating structured output predictors for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2078–2092, Online. Association for Computational Linguistics.
  38. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  39. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  40. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  41. Selective question answering under domain shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5684–5696, Online. Association for Computational Linguistics.
  42. Conceptual metaphors impact perceptions of human-ai collaboration. Proc. ACM Hum.-Comput. Interact., 4(CSCW2).
  43. Paul Kiparsky and Carol Kiparsky. 1970. FACT, pages 143–173. De Gruyter Mouton, Berlin, Boston.
  44. René F Kizilcec. 2016. How much information? effects of transparency on trust in an algorithmic interface. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 2390–2395.
  45. Calibrated language model fine-tuning for in- and out-of-distribution data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1326–1340, Online. Association for Computational Linguistics.
  46. George Lakoff. 1975. Hedges: A study in meaning criteria and the logic of fuzzy concepts. In Contemporary Research in Philosophical Logic and Linguistic Semantics: Proceedings of a Conference Held at the University of Western Ontario, London, Canada, pages 221–271. Springer.
  47. Shizuka Lauwereyns. 2002. Hedges in japanese conversation: The influence of age, sex, and formality. Language Variation and Change, 14(2):239–259.
  48. Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746.
  49. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022.
  50. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4791–4797, Singapore. Association for Computational Linguistics.
  51. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.
  52. Warmth and competence in human-agent cooperation. arXiv preprint arXiv:2201.13448.
  53. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  54. Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence, 267:1–38.
  55. Pilar Mur-Duenas. 2021. There may be differences: Analysing the use of hedges in english and spanish research articles. Lingua, 260:103131.
  56. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
  57. Thu Nguyen Thi Thuy. 2018. A corpus-based study on cross-cultural divergence in the use of hedges in academic research articles written by vietnamese and native english-speaking authors. Social Sciences, 7(4):70.
  58. OpenAI. 2023. Gpt-4 technical report.
  59. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  60. Manipulating and measuring model interpretability. In Proceedings of the 2021 CHI conference on human factors in computing systems, pages 1–52.
  61. On hedging in physician-physician discourse. Di Pietro, R.J., Ed., Linguistics and the Professions, pages 83–97.
  62. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076.
  63. José Sanders and Wilbert Spooren. 1996. Subjectivity and certainty in epistemic modality: A study of dutch epistemic modifiers.
  64. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
  65. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States. Association for Computational Linguistics.
  66. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ArXiv, abs/2206.04615.
  67. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  68. Quantifying uncertainty in natural language explanations of large language models. arXiv preprint arXiv:2311.03533.
  69. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, Singapore. Association for Computational Linguistics.
  70. Llama 2: Open foundation and fine-tuned chat models.
  71. Ming-Yu Tseng and Grace Zhang. 2023. How uncertainty can be turned into shared understanding. A Pragmatic Agenda for Healthcare: Fostering inclusion and active participation through shared understanding, 338:373.
  72. Measuring the vague meanings of probability terms. Journal of Experimental Psychology: General, 115(4):348.
  73. Xinru Wang and Ming Yin. 2021. Are explanations helpful? a comparative study of the effects of explanations in ai-assisted decision-making. In 26th International Conference on Intelligent User Interfaces, IUI ’21, page 318–328, New York, NY, USA. Association for Computing Machinery.
  74. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  75. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  76. Paul D Windschitl and Gary L Wells. 1996. Measuring psychological uncertainty: Verbal versus numeric methods. Journal of Experimental Psychology: Applied, 2(4):343.
  77. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.
  78. Oktay Yagız and Cuneyt Demir. 2014. Hedging strategies in academic discourse: a comparative analysis of turkish writers and native writers of english. Procedia-Social and Behavioral Sciences, 158:260–268.
  79. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  80. Navigating the grey area: Expressions of overconfidence and uncertainty in language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kaitlyn Zhou (11 papers)
  2. Jena D. Hwang (36 papers)
  3. Xiang Ren (194 papers)
  4. Maarten Sap (86 papers)
Citations (36)