Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets (2310.06824v3)

Published 10 Oct 2023 in cs.AI

Abstract: LLMs have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Can language models encode perceptual structure without grounding? a case study in color, 2021.
  2. Understanding intermediate layers using linear classifier probes, 2018.
  3. Concrete problems in ai safety, 2016.
  4. The internal state of an llm knows when its lying, 2023.
  5. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1907375117. URL https://www.pnas.org/content/early/2020/08/31/1907375117.
  6. Measuring progress on scalable oversight for large language models, 2022.
  7. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGuby0hcs.
  8. Open problems and fundamental limitations of reinforcement learning from human feedback, 2023a.
  9. Explore, establish, exploit: Red teaming language models from scratch, 2023b.
  10. Eliciting latent knowledge: How to tell if your eyes deceive you, 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.jrzi4atzacns.
  11. Deep reinforcement learning from human preferences, 2023.
  12. Sparse autoencoders find highly interpretable features in language models, 2023.
  13. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In AAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/CorpusID:56895415.
  14. Toy models of superposition. Transformer Circuits Thread, 2022.
  15. Geonames. All cities with a population >>> 1000, 2023. URL https://download.geonames.org/export/dump/.
  16. Localizing model behavior with path patching, 2023.
  17. Finding neurons in a haystack: Case studies with sparse probing, 2023.
  18. Linearity of relation decoding in transformer language models, 2023.
  19. Still no lie detector for language models: Probing empirical and conceptual roadblocks, 2023.
  20. Implicit representations of meaning in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1813–1827, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.143. URL https://aclanthology.org/2021.acl-long.143.
  21. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=DeG07_TcZvT.
  22. Inference-time intervention: Eliciting truthful answers from a language model, 2023b.
  23. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  24. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022.
  25. OpenAI. Gpt-4 technical report, 2023.
  26. Ai deception: A survey of examples, risks, and potential solutions, 2023.
  27. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gJcEM8sxHK.
  28. Discovering language model behaviors with model-written evaluations, 2022.
  29. Plotly Technologies Inc. Collaborative data science, 2015. URL https://plot.ly.
  30. Neuron-level interpretation of deep NLP models: A survey. Transactions of the Association for Computational Linguistics, 10:1285–1303, 2022. doi: 10.1162/tacl˙a˙00519. URL https://aclanthology.org/2022.tacl-1.74.
  31. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  32. Jacob Steinhardt. Emergent deception and emergent optimization, 2023. URL https://bounded-regret.ghost.io/emergent-deception-optimization/.
  33. Linear representations of sentiment in large language models, 2023.
  34. Llama: Open and efficient foundation language models, 2023a.
  35. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  36. Finding skill neurons in pre-trained transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11132–11152, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.765. URL https://aclanthology.org/2022.emnlp-main.765.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Samuel Marks (18 papers)
  2. Max Tegmark (133 papers)
Citations (114)

Summary

We haven't generated a summary for this paper yet.