Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Consistency of Large Language Models under Ambiguity (2310.13439v1)

Published 20 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations.
  2. Do models explain themselves? counterfactual simulatability of natural language explanations.
  3. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  4. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
  5. Evaluating superhuman models with consistency checks.
  6. Language models (mostly) know what they know.
  7. B. A. Levinstein and Daniel A. Herrmann. 2023. Still no lie detector for language models: Probing empirical and conceptual roadblocks.
  8. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research.
  9. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  10. We’re afraid language models aren’t modeling ambiguity.
  11. OpenAI. 2023. Gpt-4 technical report.
  12. Training language models to follow instructions with human feedback.
  13. Measuring reliability of large language models through semantic consistency. CoRR, abs/2211.05853.
  14. Whose opinions do language models reflect?
  15. Task ambiguity in humans and language models. In The Eleventh International Conference on Learning Representations.
  16. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.
  17. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903.
  18. Do large language models know what they don’t know?
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Henning Bartsch (5 papers)
  2. Ole Jorgensen (2 papers)
  3. Domenic Rosati (22 papers)
  4. Jason Hoelscher-Obermaier (10 papers)
  5. Jacob Pfau (10 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com