An Evaluation of Estimative Uncertainty in Large Language Models (2405.15185v1)
Abstract: Words of estimative probability (WEPs), such as ''maybe'' or ''probably not'' are ubiquitous in natural language for communicating estimative uncertainty, compared with direct statements involving numerical probability. Human estimative uncertainty, and its calibration with numerical estimates, has long been an area of study -- including by intelligence agencies like the CIA. This study compares estimative uncertainty in commonly used LLMs like GPT-4 and ERNIE-4 to that of humans, and to each other. Here we show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English. Divergence is also observed when the LLM is presented with gendered roles and Chinese contexts. Further study shows that an advanced LLM like GPT-4 can consistently map between statistical and estimative uncertainty, but a significant performance gap remains. The results contribute to a growing body of research on human-LLM alignment.
- Verbal versus numerical probabilities: Efficiency, biases, and the preference paradox. \JournalTitleOrganizational behavior and human decision processes 45, 1–18 (1990).
- Do people really prefer verbal probabilities? \JournalTitlePsychological research 84, 2325–2338 (2020).
- Hyland, K. Writing without conviction? hedging in science research articles. \JournalTitleApplied linguistics 17, 433–454 (1996).
- Vlasyan, G. R. et al. Linguistic hedging in the light of politeness theory. \JournalTitleEuropean Proceedings of Social and Behavioural Sciences (2018).
- Communicating uncertainty: Media coverage of new and controversial science (Routledge, 2012).
- Kent, S. Words of estimative probability. \JournalTitleStudies in intelligence 8, 49–65 (1964).
- Beyth-Marom, R. How probable is probable? a numerical translation of verbal probability expressions. \JournalTitleJournal of forecasting 1, 257–269 (1982).
- Handling and mishandling estimative probability: Likelihood, confidence, and the search for bin laden. \JournalTitleIntelligence and National Security 30, 77–99 (2015).
- Shinagare, A. B. et al. Radiologist preferences, agreement, and variability in phrases used to convey diagnostic certainty in radiology reports. \JournalTitleJournal of the American College of Radiology 16, 458–464 (2019).
- Measuring the vague meanings of probability terms. \JournalTitleJournal of Experimental Psychology: General 115, 348 (1986).
- Lenhardt, E. D. et al. How likely is that chance of thunderstorms? a study of how national weather service forecast offices use words of estimative probability and what they mean to the public. \JournalTitleJournal of Operational Meteorology (2020).
- Barclay, S. et al. Handbook for decisions analysis. (1977).
- Fagen-Ulmschneider, W. Perception of probability words (2019).
- Zhao, W. X. et al. A survey of large language models. \JournalTitlearXiv preprint arXiv:2303.18223 (2023).
- Achiam, J. et al. Gpt-4 technical report. \JournalTitlearXiv preprint arXiv:2303.08774 (2023).
- Touvron, H. et al. Llama: Open and efficient foundation language models. \JournalTitlearXiv preprint arXiv:2302.13971 (2023).
- Openai’s gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5.
- Openai’s gpt-4. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo.
- Zhang, T. et al. Benchmarking large language models for news summarization. \JournalTitleTransactions of the Association for Computational Linguistics 12, 39–57 (2024).
- Llms for customer service and support. https://www.databricks.com/solutions/accelerators/llms-customer-service-and-support.
- Shared interest: Measuring human-ai alignment to identify recurring patterns in model behavior. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 1–17 (2022).
- Ji, J. et al. Ai alignment: A comprehensive survey. \JournalTitlearXiv preprint arXiv:2310.19852 (2023).
- Gabriel, I. Artificial intelligence, values, and alignment. \JournalTitleMinds and machines 30, 411–437 (2020).
- Human-aligned artificial intelligence is a multiobjective problem. \JournalTitleEthics and Information Technology 20, 27–40 (2018).
- Baidu’s ernie-4.0. https://yiyan.baidu.com/.
- Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, 41092–41110 (PMLR, 2023).
- Kocmi, T. et al. Findings of the 2023 conference on machine translation (wmt23): Llms are here but not quite there yet. In Proceedings of the Eighth Conference on Machine Translation, 1–42 (2023).
- Sumblogger: Abstractive summarization of large collections of scientific articles. In European Conference on Information Retrieval, 371–386 (Springer, 2024).
- Aclsum: A new dataset for aspect-based summarization of scientific publications. \JournalTitlearXiv preprint arXiv:2403.05303 (2024).
- Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. \JournalTitleAdvances in Neural Information Processing Systems 35, 24824–24837 (2022).
- Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. \JournalTitlearXiv preprint arXiv:2203.11171 (2022).
- Large language models are zero-shot reasoners. \JournalTitleAdvances in neural information processing systems 35, 22199–22213 (2022).
- Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. \JournalTitlearXiv preprint arXiv:2212.10509 (2022).
- Probing neural language models for understanding of words of estimative probability. \JournalTitlearXiv preprint arXiv:2211.03358 (2022).
- A primer in bertology: What we know about how bert works. \JournalTitleTransactions of the Association for Computational Linguistics 8, 842–866 (2021).
- Bert: Pre-training of deep bidirectional transformers for language understanding. \JournalTitlearXiv preprint arXiv:1810.04805 (2018).
- Using cognitive psychology to understand gpt-3. \JournalTitleProceedings of the National Academy of Sciences 120, e2218523120 (2023).
- Wang, Y. et al. Aligning large language models with human: A survey. \JournalTitlearXiv preprint arXiv:2307.12966 (2023).
- \JournalTitleHarvard Business Review (2018).
- On information and sufficiency. \JournalTitleThe annals of mathematical statistics 22, 79–86 (1951).
- On a test of whether one of two random variables is stochastically larger than the other. \JournalTitleThe annals of mathematical statistics 50–60 (1947).