Schema-learning and rebinding as mechanisms of in-context learning and emergence (2307.01201v1)
Abstract: In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based LLMs. Yet the mechanisms that underlie it are poorly understood. In this paper, we demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method using clone-structured causal graphs (CSCGs). Moreover, a key property of CSCGs is that, unlike transformer-based LLMs, they are {\em interpretable}, which considerably simplifies the task of explaining how ICL works. Specifically, we show that it uses a combination of (a) learning template (schema) circuits for pattern completion, (b) retrieving relevant templates in a context-sensitive manner, and (c) rebinding of novel tokens to appropriate slots in the templates. We go on to marshall evidence for the hypothesis that similar mechanisms underlie ICL in LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different levels of overparameterization, suggesting that overparameterization helps in learning more complex template (schema) circuits. By showing how ICL can be achieved with small models and datasets, we open up a path to novel architectures, and take a vital step towards a more general understanding of the mechanics behind this important capability.
- “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
- Taylor Webb, Keith J Holyoak and Hongjing Lu “Emergent Analogical Reasoning in Large Language Models” In arXiv preprint arXiv:2212.09196, 2022
- “An explanation of in-context learning as implicit bayesian inference” In arXiv preprint arXiv:2111.02080, 2021
- “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers” In arXiv preprint arXiv:2212.10559, 2022
- “In-context learning and induction heads” In arXiv preprint arXiv:2209.11895, 2022
- “Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps” In Nature communications 12.1 Nature Publishing Group UK London, 2021, pp. 2392
- “Learning higher-order sequential structure with cloned HMMs” In arXiv preprint arXiv:1905.00507, 2019
- “Abstraction for Deep Reinforcement Learning” In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 2022, pp. 5588–5596
- “Progress measures for grokking via mechanistic interpretability” In arXiv preprint arXiv:2301.05217, 2023
- “Locating and Editing Factual Associations in GPT” In Advances in Neural Information Processing Systems 36, 2022
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- Lonnie Chrisman “Reinforcement learning with perceptual aliasing: The perceptual distinctions approach” In AAAI 1992, 1992, pp. 183–188 Citeseer
- Judea Pearl “Causality” Cambridge university press, 2009
- “Learning overcomplete hmms” In Advances in Neural Information Processing Systems 30, 2017
- “Graph schemas as abstractions for transfer learning, inference, and planning” In arXiv preprint arXiv:2302.07350, 2023
- Jonas Peters, Dominik Janzing and Bernhard Schölkopf “Elements of causal inference: foundations and learning algorithms” The MIT Press, 2017
- “Exact Bayesian structure learning from uncertain interventions” In Artificial intelligence and statistics, 2007, pp. 107–114 PMLR
- Arthur P Dempster, Nan M Laird and Donald B Rubin “Maximum likelihood from incomplete data via the EM algorithm” In Journal of the royal statistical society: series B (methodological) 39.1 Wiley Online Library, 1977, pp. 1–22
- Judea Pearl “Probabilistic reasoning in intelligent systems: networks of plausible inference” Morgan kaufmann, 1988
- “Transformers learn shortcuts to automata” In arXiv preprint arXiv:2210.10749, 2022
- Oren Kolodny, Arnon Lotem and Shimon Edelman “Learning a generative probabilistic grammar of experience: A process-level model of language acquisition” In Cognitive Science 39.2 Wiley Online Library, 2015, pp. 227–267
- “Factorial hidden Markov models” In Advances in Neural Information Processing Systems 8, 1995
- Frederick Jelinek “Continuous speech recognition by statistical methods” In Proceedings of the IEEE 64.4 IEEE, 1976, pp. 532–556
- “An algorithm for drawing general undirected graphs” In Information processing letters 31.1, 1989, pp. 7–15
- Haley A Vlach and Catherine A DeBrock “Remember dax? Relations between children’s cross-situational word learning, memory, and language abilities” In Journal of memory and language 93 Elsevier, 2017, pp. 217–230
- “PreCo: A large-scale dataset in preschool vocabulary for coreference resolution” In arXiv preprint arXiv:1810.09807, 2018
- Patrick H Winston “Learning and reasoning by analogy” In Communications of the ACM 23.12 ACM New York, NY, USA, 1980, pp. 689–703
- “Behavioral time scale synaptic plasticity underlies CA1 place fields” In Science 357.6355 American Association for the Advancement of Science, 2017, pp. 1033–1036
- “Chain of thought prompting elicits reasoning in large language models” In arXiv preprint arXiv:2201.11903, 2022
- “Least-to-most prompting enables complex reasoning in large language models” In arXiv preprint arXiv:2205.10625, 2022
- Léon Bottou, Frank E Curtis and Jorge Nocedal “Optimization methods for large-scale machine learning” In SIAM review 60.2 SIAM, 2018, pp. 223–311
- Sebastian Ruder “An overview of gradient descent optimization algorithms” In arXiv preprint arXiv:1609.04747, 2016
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
- “Generalizing from a few examples: A survey on few-shot learning” In ACM computing surveys (csur) 53.3 ACM New York, NY, USA, 2020, pp. 1–34
- Li Fei-Fei, Robert Fergus and Pietro Perona “One-shot learning of object categories” In IEEE transactions on pattern analysis and machine intelligence 28.4 IEEE, 2006, pp. 594–611
- Brenden M Lake, Ruslan Salakhutdinov and Joshua B Tenenbaum “Human-level concept learning through probabilistic program induction” In Science 350.6266 American Association for the Advancement of Science, 2015, pp. 1332–1338
- “Multitask prompted training enables zero-shot task generalization” In arXiv preprint arXiv:2110.08207, 2021
- “Finetuned language models are zero-shot learners” In arXiv preprint arXiv:2109.01652, 2021
- “Metaicl: Learning to learn in context” In arXiv preprint arXiv:2110.15943, 2021
- “Meta-learning via language model in-context tuning” In arXiv preprint arXiv:2110.07814, 2021
- Devang K Naik and Richard J Mammone “Meta-neural networks that learn by learning” In [Proceedings 1992] IJCNN International Joint Conference on Neural Networks 1, 1992, pp. 437–442 IEEE
- “Optimization as a model for few-shot learning” In International conference on learning representations, 2017
- Sepp Hochreiter, A Steven Younger and Peter R Conwell “Learning to learn using gradient descent” In Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria, August 21–25, 2001 Proceedings 11, 2001, pp. 87–94 Springer
- “Data distributional properties drive emergent in-context learning in transformers” In Advances in Neural Information Processing Systems 35, 2022, pp. 18878–18891
- “Transformers as Algorithms: Generalization and Stability in In-context Learning”
- “What can transformers learn in-context? a case study of simple function classes” In Advances in Neural Information Processing Systems 35, 2022, pp. 30583–30598
- “What learning algorithm is in-context learning? investigations with linear models” In arXiv preprint arXiv:2211.15661, 2022
- Yuxuan Li and James L McClelland “Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks” In arXiv preprint arXiv:2210.00400, 2022
- “On the effect of pretraining corpora on in-context learning by a large-scale language model” In arXiv preprint arXiv:2204.13509, 2022
- “Emergent abilities of large language models” In arXiv preprint arXiv:2206.07682, 2022
- “Calibrate before use: Improving few-shot performance of language models” In International Conference on Machine Learning, 2021, pp. 12697–12706 PMLR
- “Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity” In arXiv preprint arXiv:2104.08786, 2021
- “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?” In arXiv preprint arXiv:2202.12837, 2022
- “What Makes Good In-Context Examples for GPT-3333?” In arXiv preprint arXiv:2101.06804, 2021
- “Can language models learn from explanations in context?” In arXiv preprint arXiv:2204.02329, 2022
- Rylan Schaeffer, Brando Miranda and Sanmi Koyejo “Are Emergent Abilities of Large Language Models a Mirage?” In arXiv preprint arXiv:2304.15004, 2023
- Sivaramakrishnan Swaminathan (7 papers)
- Antoine Dedieu (19 papers)
- Rajkumar Vasudeva Raju (6 papers)
- Murray Shanahan (46 papers)
- Dileep George (29 papers)
- Miguel Lazaro-Gredilla (10 papers)