Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models (2405.02917v1)

Published 5 May 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Language and Vision-LLMs (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial. This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting. We propose the new Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilities via difficult queries and object counting, and the Net Calibration Error (NCE) to measure direction of miscalibration. Results show that both LLMs and VLMs have a high calibration error and are overconfident most of the time, indicating a poor capability for uncertainty estimation. Additionally we develop prompts for regression tasks, and we show that VLMs have poor calibration when producing mean/standard deviation and 95% confidence intervals.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  3. A close look into the calibration of pre-trained language models. arXiv preprint arXiv:2211.00151.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  5. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  6. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
  7. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236.
  8. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  9. Uncertainty-aware evaluation for vision-language models. arXiv preprint arXiv:2402.14418.
  10. OpenAI. 2023. Gpt-4v(ision) system card.
  11. Martino Pelucchi. 2023. Exploring chatgpt’s accuracy and confidence in high-resource languages.
  12. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  13. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  14. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  15. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  16. Matias Valdenegro-Toro. 2021. I find your lack of uncertainty in computer vision disturbing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1263–1272.
  17. Sean Wallis. 2013. Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics, 20(3):178–208.
  18. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  19. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Tobias Groot (1 paper)
  2. Matias Valdenegro-Toro (62 papers)
Citations (3)