Papers
Topics
Authors
Recent
Search
2000 character limit reached

Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks

Published 30 Jun 2023 in cs.CL and cs.LG | (2307.00175v1)

Abstract: We consider the questions of whether or not LLMs have beliefs, and, if they do, how we might measure them. First, we evaluate two existing approaches, one due to Azaria and Mitchell (2023) and the other to Burns et al. (2022). We provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. After describing our empirical results we take a step back and consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the problem. We conclude by suggesting some concrete paths for future work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
  2. The internal state of an llm knows when its lying.
  3. Recognition in terra incognita.
  4. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.  610–623.
  5. Discovering latent knowledge in language models without supervision.
  6. Arc’s first technical report: Eliciting latent knowledge.
  7. Cowie, C. (2014). In defence of instrumentalism about epistemic normativity. Synthese 191(16), 4003–4017.
  8. Ten great ideas about chance. Princeton University Press.
  9. Truthful ai: Developing and governing ai that does not lie. arXiv preprint arXiv:2110.06674.
  10. Everett, B. (2013). An introduction to latent variable models. Springer Science & Business Media.
  11. States and contingencies: How to understand savage without anyone being hanged. Revue économique 71(2), 365–385.
  12. Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102(477), 359–378.
  13. Godfrey-Smith, P. (1991). Signal, decision, action. The Journal of philosophy 88(12), 709–722.
  14. Godfrey-Smith, P. (1998). Complexity and the Function of Mind in Nature. Cambridge University Press.
  15. Deep learning. MIT press.
  16. Harding, J. (2023). Operationalising representation in natural language processing. arXiv preprint arXiv:2306.08193.
  17. Hempel, C. G. (1958). The theoretician’s dilemma: A study in the logic of theory construction. Minnesota Studies in the Philosophy of Science 2, 173–226.
  18. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.
  19. Jeffrey, R. C. (1990). The logic of decision. University of Chicago press.
  20. Survey of hallucination in natural language generation. ACM Computing Surveys 55(12), 1–38.
  21. Jiang, H. (2023). A latent space theory for emergent abilities in large language models. arXiv preprint arXiv:2304.09960.
  22. Unifiedqa: Crossing format boundaries with a single qa system.
  23. Levinstein, B. (2023). A conceptual guide to transformers.
  24. Lieder, F. and T. L. Griffiths (2020). Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and brain sciences 43, e1.
  25. Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16(3), 31–57.
  26. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp.  142–150.
  27. Millikan, R. G. (1995). White queen psychology and other essays for Alice. mit Press Cambridge.
  28. Papineau, D. (1988). Reality and representation. Mind 97(388).
  29. Quine, W. V. (1969). Natural kinds. In Essays in honor of Carl G. Hempel: A tribute on the occasion of his sixty-fifth birthday, pp.  5–23. Springer.
  30. Quine, W. V. O. (1960). Word and object. MIT Press.
  31. Ramsey, F. P. (2016). Truth and probability. Readings in Formal Epistemology: Sourcebook, 21–45.
  32. Savage, L. J. (1972). The foundations of statistics. Courier Corporation.
  33. Shanahan, M. (2022). Talking about large language models. arXiv preprint arXiv:2212.03551.
  34. Smead, R. (2015). The role of social interaction in the evolution of learning. The British Journal for the Philosophy of Science.
  35. Smead, R. S. (2009). Social interaction and the evolution of learning rules. University of California, Irvine.
  36. Sober, E. (1994). The adaptive advantage of learning and a priori prejudice. Ethology and Sociobiology 15(1), 55–56.
  37. Stephens, C. L. (2001). When is it selectively advantageous to have true beliefs? sandwiching the better safe than sorry argument. Philosophical Studies 105, 161–189.
  38. Stich, S. P. (1990). The fragmentation of reason: Preface to a pragmatic theory of cognitive evaluation. The MIT Press.
  39. Street, S. (2009). Evolution and the normativity of epistemic reasons. Canadian Journal of Philosophy Supplementary Volume 35, 213–248.
  40. Llama: Open and efficient foundation language models.
  41. The framing of decisions and the psychology of choice. science 211(4481), 453–458.
  42. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Advances in Neural Information Processing Systems, Volume 30. Curran Associates, Inc.
  43. Gpt-j-6b: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax.
  44. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080.
  45. Opt: Open pre-trained transformer language models.
Citations (41)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.