Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? (2312.03729v1)

Published 27 Nov 2023 in cs.CL and cs.AI

Abstract: Neural LLMs (LMs) can be used to evaluate the truth of factual statements in two ways: they can be either queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents. Is this an accurate description of today's LMs, or can query-probe disagreement arise in other ways? We identify three different classes of disagreement, which we term confabulation, deception, and heterogeneity. In many cases, the superiority of probes is simply attributable to better calibration on uncertain answers rather than a greater fraction of correct, high-confidence answers. In some cases, queries and probes perform better on different subsets of inputs, and accuracy can further be improved by ensembling the two. Code is available at github.com/lingo-mit/lm-truthfulness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Mirages: On anthropomorphism in dialogue systems. arXiv preprint arXiv:2305.09800.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  3. Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. arXiv preprint arXiv:2304.13734.
  4. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
  5. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations.
  6. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the Conference of the North American Chapter of the Association for Comptuational Linguistics.
  7. Benj Edwards. 2023. Why ChatGPT and Bing Chat are so good at making things up. Ars Technica.
  8. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175.
  9. Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674.
  10. Evan Hernandez and Jacob Andreas. 2021. The low-dimensional linear geometry of contextualized word representations. In Proceedings of the Conference on Natural Language Learning.
  11. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  12. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  13. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  14. James Edwin Mahon. 2008. The definition of lying and deception.
  15. Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824.
  16. Identifying fluently inadequate output in neural and statistical machine translation. In Proceedings of Machine Translation Summit XVII: Research Track.
  17. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  18. CREAK: A dataset for commonsense reasoning over entity knowledge. In NeurIPS Datasets And Benchmarks.
  19. Language models as knowledge bases? In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
  20. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  21. Linear adversarial concept erasure. In Proceedings of the International Conference on Machine Learning.
  22. Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceedings of the Annual Meeting of the European Association for Computational Linguistics.
  23. Neural theory-of-mind? On the limits of social intelligence in large lms. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  24. Murray Shanahan. 2022. Talking about large language models. arXiv preprint arXiv:2212.03551.
  25. Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
  26. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  27. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.
Citations (31)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com