Transformers represent belief state geometry in their residual stream (2405.15943v2)
Abstract: What computational structure are we building into LLMs when we train them on next-token prediction? Here, we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. Leveraging the theory of optimal prediction, we anticipate and then find that belief states are linearly represented in the residual stream of transformers, even in cases where the predicted belief state geometry has highly nontrivial fractal structure. We investigate cases where the belief state geometry is represented in the final residual stream or distributed across the residual streams of multiple layers, providing a framework to explain these observations. Furthermore we demonstrate that the inferred belief states contain information about the entire future, beyond the local next-token prediction that the transformers are explicitly trained on. Our work provides a general framework connecting the structure of training data to the geometric structure of activations inside transformers.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- J. P. Crutchfield. Between order and chaos. Nature Physics, 8(January):17–24, 2012. doi: 10.1038/NPHYS2190.
- Synchronization and control in intrinsic and designed computation: An information-theoretic analysis of competing models of stochastic computation. CHAOS, 20(3):037105, 2010. Santa Fe Institute Working Paper 10-08-015; arxiv.org:1007.5354 [cond-mat.stat-mech].
- Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- World models. arXiv preprint arXiv:1803.10122, 2018.
- Shannon entropy rate of hidden Markov processes. Journal of Statistical Physics, 183(2):32, 2021.
- Nearly maximally predictive features and their dimensions. Phys. Rev. E, 95(5):051301(R), 2017. doi: 10.1103/PhysRevE.95.051301. SFI Working Paper 17-02-007; arxiv.org:1702.08565 [cond-mat.stat-mech].
- Transformerlens. https://github.com/TransformerLensOrg/TransformerLens, 2022.
- Future lens: Anticipating subsequent tokens from a single hidden state. arXiv preprint arXiv:2311.04897, 2023.
- Spectral simplicity of apparent complexity, Part I: The nondiagonalizable metadynamics of prediction. Chaos, 28:033115, 2018a. doi: 10.1063/1.4985199.
- Spectral simplicity of apparent complexity, Part II: Exact complexities and complexity spectra. Chaos, 28:033116, 2018b. doi: 10.1063/1.4986248.
- Computational mechanics: Pattern and prediction, structure and simplicity. J. Stat. Phys., 104:817–879, 2001.
- Ilya Sutskever. Building AGI, alignment, future models, spies, Microsoft, Taiwan, & enlightenment. Podcast on Dwarkesh Patel Podcast, March 2023. URL https://www.youtube.com/watch?v=YEUclZdj_Sc. Accessed: 2024-05-22.
- D. R. Upper. Theory and Algorithms for Hidden Markov Models and Generalized Hidden Markov Models. PhD thesis, University of California, Berkeley, 1997. Published by University Microfilms Intl, Ann Arbor, Michigan.