Emergence and Function of Abstract Representations in Self-Supervised Transformers (2312.05361v1)
Abstract: Human intelligence relies in part on our brains' ability to create abstract mental models that succinctly capture the hidden blueprint of our reality. Such abstract world models notably allow us to rapidly navigate novel situations by generalizing prior knowledge, a trait deep learning systems have historically struggled to replicate. However, the recent shift from supervised to self-supervised objectives, combined with expressive transformer-based architectures, have yielded powerful foundation models that appear to learn versatile representations that can support a wide range of downstream tasks. This promising development raises the intriguing possibility of such models developing in silico abstract world models. We test this hypothesis by studying the inner workings of small-scale transformers trained to reconstruct partially masked visual scenes generated from a simple blueprint. We show that the network develops intermediate abstract representations, or abstractions, that encode all semantic features of the dataset. These abstractions manifest as low-dimensional manifolds where the embeddings of semantically related tokens transiently converge, thus allowing for the generalization of downstream computations. Using precise manipulation experiments, we demonstrate that abstractions are central to the network's decision-making process. Our research also suggests that these abstractions are compositionally structured, exhibiting features like contextual independence and part-whole relationships that mirror the compositional nature of the dataset. Finally, we introduce a Language-Enhanced Architecture (LEA) designed to encourage the network to articulate its computations. We find that LEA develops an abstraction-centric language that can be easily interpreted, allowing us to more readily access and steer the network's decision-making process.
- Understanding of a convolutional neural network. In 2017 international conference on engineering and technology (ICET), pages 1–6. Ieee, 2017.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
- Autoencoders. Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook, pages 353–374, 2023.
- Neural population control via deep image synthesis. Science, 364(6439):eaav9436, 2019.
- What is a cognitive map? organizing knowledge for flexible behavior. Neuron, 100(2):490–509, 2018.
- Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
- Brain decoding: toward real-time reconstruction of visual perception. arXiv preprint arXiv:2310.19812, 2023.
- Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Irving Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987.
- Compositionality, mdl priors, and object recognition. Advances in neural information processing systems, 9, 1996.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Thread: Circuits. Distill, 2020. doi:10.23915/distill.00024. https://distill.pub/2020/circuits.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
- Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Toy models of superposition. Transformer Circuits Thread, 2022.
- Jacob Feldman. The structure of perceptual categories. Journal of mathematical psychology, 41(2):145–170, 1997.
- Karl Friston. The free-energy principle: a unified brain theory? Nature reviews neuroscience, 11(2):127–138, 2010.
- World model learning and inference. Neural Networks, 144:573–590, 2021.
- Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021.
- Inducing causal structure for interpretable neural networks. In International Conference on Machine Learning, pages 7324–7338. PMLR, 2022.
- Finding alignments between interpretable causal variables and distributed neural representations. arXiv preprint arXiv:2303.02536, 2023.
- Multimodal neurons in artificial neural networks. Distill, 2021. doi:10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons.
- Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3):369–380, 2022.
- Schema formation in a neural population subspace underlies learning-to-learn in flexible sensorimotor problem-solving. Nature Neuroscience, 26(5):879–890, 2023.
- Stephen Grossberg. Adaptive resonance theory: How a brain learns to consciously attend, learn, and recognize a changing world. Neural networks, 37:1–47, 2013.
- Stephen Grossberg. Conscious mind, resonant brain: how each brain makes a mind. Oxford University Press, 2021.
- A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends. arXiv preprint arXiv:2301.05712, 2023.
- Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
- MD György Buzsáki. The brain from inside out. Oxford University Press, 2019.
- Linear latent world models in simple transformers: A case study on othello-gpt. arXiv preprint arXiv:2310.07582, 2023.
- In-context learning creates task vectors. arXiv preprint arXiv:2310.15916, 2023.
- beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
- Geoffrey Hinton. How to represent part-whole hierarchies in a neural network. Neural Computation, 35(3):413–452, 2023.
- Principal component analysis: a review and recent developments. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, 2016.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Deep convolutional inverse graphics network. Advances in neural information processing systems, 28, 2015.
- Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
- Feature collapse. arXiv preprint arXiv:2305.16162, 2023.
- Yann LeCun. A path towards autonomous machine intelligence. URL https://openreview.net, 2022.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022.
- Optogenetic stimulation of a hippocampal engram activates fear memory recall. Nature, 484(7394):381–385, 2012.
- Cross-entropy loss functions: Theoretical analysis and applications. arXiv preprint arXiv:2304.07288, 2023.
- The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022a.
- Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022b.
- Composition in distributional models of semantics. Cognitive science, 34(8):1388–1429, 2010.
- Characterizing intrinsic compositionality in transformers with tree projections. arXiv preprint arXiv:2211.01288, 2022.
- Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023.
- Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Self-supervised learning: A succinct review. Archives of Computational Methods in Engineering, 30(4):2761–2775, 2023.
- Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 464–483. IEEE, 2023.
- Null it out: Guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667, 2020.
- Stephen K Reed. A taxonomic analysis of abstraction. Perspectives on Psychological Science, 11(6):817–837, 2016.
- To compress or not to compress–self-supervised learning and information theory: A review. arXiv preprint arXiv:2304.09355, 2023.
- Zoltán Gendler Szabó. The case for compositionality. The Oxford Handbook of Compositionality, pages 64–80, 2012.
- Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, pages 1–9, 2023.
- The computational limits of deep learning. arXiv preprint arXiv:2007.05558, 2020.
- Linear representations of sentiment in large language models. arXiv preprint arXiv:2310.15154, 2023.
- Function vectors in large language models. arXiv preprint arXiv:2310.15213, 2023.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Towards data-and knowledge-driven artificial intelligence: A survey on neuro-symbolic computing. arXiv preprint arXiv:2210.15889, 2022.
- Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience, 22(2):297–306, 2019.
- Ood-bench: Quantifying and understanding two dimensions of out-of-distribution generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7947–7958, 2022.
- Quentin RV. Ferry (1 paper)
- Joshua Ching (2 papers)
- Takashi Kawai (1 paper)