Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Schema-learning and rebinding as mechanisms of in-context learning and emergence (2307.01201v1)

Published 16 Jun 2023 in cs.CL and cs.AI

Abstract: In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based LLMs. Yet the mechanisms that underlie it are poorly understood. In this paper, we demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method using clone-structured causal graphs (CSCGs). Moreover, a key property of CSCGs is that, unlike transformer-based LLMs, they are {\em interpretable}, which considerably simplifies the task of explaining how ICL works. Specifically, we show that it uses a combination of (a) learning template (schema) circuits for pattern completion, (b) retrieving relevant templates in a context-sensitive manner, and (c) rebinding of novel tokens to appropriate slots in the templates. We go on to marshall evidence for the hypothesis that similar mechanisms underlie ICL in LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different levels of overparameterization, suggesting that overparameterization helps in learning more complex template (schema) circuits. By showing how ICL can be achieved with small models and datasets, we open up a path to novel architectures, and take a vital step towards a more general understanding of the mechanics behind this important capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
  2. Taylor Webb, Keith J Holyoak and Hongjing Lu “Emergent Analogical Reasoning in Large Language Models” In arXiv preprint arXiv:2212.09196, 2022
  3. “An explanation of in-context learning as implicit bayesian inference” In arXiv preprint arXiv:2111.02080, 2021
  4. “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers” In arXiv preprint arXiv:2212.10559, 2022
  5. “In-context learning and induction heads” In arXiv preprint arXiv:2209.11895, 2022
  6. “Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps” In Nature communications 12.1 Nature Publishing Group UK London, 2021, pp. 2392
  7. “Learning higher-order sequential structure with cloned HMMs” In arXiv preprint arXiv:1905.00507, 2019
  8. “Abstraction for Deep Reinforcement Learning” In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 2022, pp. 5588–5596
  9. “Progress measures for grokking via mechanistic interpretability” In arXiv preprint arXiv:2301.05217, 2023
  10. “Locating and Editing Factual Associations in GPT” In Advances in Neural Information Processing Systems 36, 2022
  11. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  12. Lonnie Chrisman “Reinforcement learning with perceptual aliasing: The perceptual distinctions approach” In AAAI 1992, 1992, pp. 183–188 Citeseer
  13. Judea Pearl “Causality” Cambridge university press, 2009
  14. “Learning overcomplete hmms” In Advances in Neural Information Processing Systems 30, 2017
  15. “Graph schemas as abstractions for transfer learning, inference, and planning” In arXiv preprint arXiv:2302.07350, 2023
  16. Jonas Peters, Dominik Janzing and Bernhard Schölkopf “Elements of causal inference: foundations and learning algorithms” The MIT Press, 2017
  17. “Exact Bayesian structure learning from uncertain interventions” In Artificial intelligence and statistics, 2007, pp. 107–114 PMLR
  18. Arthur P Dempster, Nan M Laird and Donald B Rubin “Maximum likelihood from incomplete data via the EM algorithm” In Journal of the royal statistical society: series B (methodological) 39.1 Wiley Online Library, 1977, pp. 1–22
  19. Judea Pearl “Probabilistic reasoning in intelligent systems: networks of plausible inference” Morgan kaufmann, 1988
  20. “Transformers learn shortcuts to automata” In arXiv preprint arXiv:2210.10749, 2022
  21. Oren Kolodny, Arnon Lotem and Shimon Edelman “Learning a generative probabilistic grammar of experience: A process-level model of language acquisition” In Cognitive Science 39.2 Wiley Online Library, 2015, pp. 227–267
  22. “Factorial hidden Markov models” In Advances in Neural Information Processing Systems 8, 1995
  23. Frederick Jelinek “Continuous speech recognition by statistical methods” In Proceedings of the IEEE 64.4 IEEE, 1976, pp. 532–556
  24. “An algorithm for drawing general undirected graphs” In Information processing letters 31.1, 1989, pp. 7–15
  25. Haley A Vlach and Catherine A DeBrock “Remember dax? Relations between children’s cross-situational word learning, memory, and language abilities” In Journal of memory and language 93 Elsevier, 2017, pp. 217–230
  26. “PreCo: A large-scale dataset in preschool vocabulary for coreference resolution” In arXiv preprint arXiv:1810.09807, 2018
  27. Patrick H Winston “Learning and reasoning by analogy” In Communications of the ACM 23.12 ACM New York, NY, USA, 1980, pp. 689–703
  28. “Behavioral time scale synaptic plasticity underlies CA1 place fields” In Science 357.6355 American Association for the Advancement of Science, 2017, pp. 1033–1036
  29. “Chain of thought prompting elicits reasoning in large language models” In arXiv preprint arXiv:2201.11903, 2022
  30. “Least-to-most prompting enables complex reasoning in large language models” In arXiv preprint arXiv:2205.10625, 2022
  31. Léon Bottou, Frank E Curtis and Jorge Nocedal “Optimization methods for large-scale machine learning” In SIAM review 60.2 SIAM, 2018, pp. 223–311
  32. Sebastian Ruder “An overview of gradient descent optimization algorithms” In arXiv preprint arXiv:1609.04747, 2016
  33. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
  34. “Generalizing from a few examples: A survey on few-shot learning” In ACM computing surveys (csur) 53.3 ACM New York, NY, USA, 2020, pp. 1–34
  35. Li Fei-Fei, Robert Fergus and Pietro Perona “One-shot learning of object categories” In IEEE transactions on pattern analysis and machine intelligence 28.4 IEEE, 2006, pp. 594–611
  36. Brenden M Lake, Ruslan Salakhutdinov and Joshua B Tenenbaum “Human-level concept learning through probabilistic program induction” In Science 350.6266 American Association for the Advancement of Science, 2015, pp. 1332–1338
  37. “Multitask prompted training enables zero-shot task generalization” In arXiv preprint arXiv:2110.08207, 2021
  38. “Finetuned language models are zero-shot learners” In arXiv preprint arXiv:2109.01652, 2021
  39. “Metaicl: Learning to learn in context” In arXiv preprint arXiv:2110.15943, 2021
  40. “Meta-learning via language model in-context tuning” In arXiv preprint arXiv:2110.07814, 2021
  41. Devang K Naik and Richard J Mammone “Meta-neural networks that learn by learning” In [Proceedings 1992] IJCNN International Joint Conference on Neural Networks 1, 1992, pp. 437–442 IEEE
  42. “Optimization as a model for few-shot learning” In International conference on learning representations, 2017
  43. Sepp Hochreiter, A Steven Younger and Peter R Conwell “Learning to learn using gradient descent” In Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria, August 21–25, 2001 Proceedings 11, 2001, pp. 87–94 Springer
  44. “Data distributional properties drive emergent in-context learning in transformers” In Advances in Neural Information Processing Systems 35, 2022, pp. 18878–18891
  45. “Transformers as Algorithms: Generalization and Stability in In-context Learning”
  46. “What can transformers learn in-context? a case study of simple function classes” In Advances in Neural Information Processing Systems 35, 2022, pp. 30583–30598
  47. “What learning algorithm is in-context learning? investigations with linear models” In arXiv preprint arXiv:2211.15661, 2022
  48. Yuxuan Li and James L McClelland “Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks” In arXiv preprint arXiv:2210.00400, 2022
  49. “On the effect of pretraining corpora on in-context learning by a large-scale language model” In arXiv preprint arXiv:2204.13509, 2022
  50. “Emergent abilities of large language models” In arXiv preprint arXiv:2206.07682, 2022
  51. “Calibrate before use: Improving few-shot performance of language models” In International Conference on Machine Learning, 2021, pp. 12697–12706 PMLR
  52. “Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity” In arXiv preprint arXiv:2104.08786, 2021
  53. “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?” In arXiv preprint arXiv:2202.12837, 2022
  54. “What Makes Good In-Context Examples for GPT-3333?” In arXiv preprint arXiv:2101.06804, 2021
  55. “Can language models learn from explanations in context?” In arXiv preprint arXiv:2204.02329, 2022
  56. Rylan Schaeffer, Brando Miranda and Sanmi Koyejo “Are Emergent Abilities of Large Language Models a Mirage?” In arXiv preprint arXiv:2304.15004, 2023
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sivaramakrishnan Swaminathan (7 papers)
  2. Antoine Dedieu (19 papers)
  3. Rajkumar Vasudeva Raju (6 papers)
  4. Murray Shanahan (46 papers)
  5. Dileep George (29 papers)
  6. Miguel Lazaro-Gredilla (10 papers)
Citations (7)

Summary

Insights into Schema-Learning and Rebinding in In-Context Learning Models

The paper "Schema-learning and rebinding as mechanisms of in-context learning and emergence" explores the inner workings of in-context learning (ICL) as observed in LLMs and proposes an alternative learning model that could provide a clearer understanding of this phenomenon. Specifically, it introduces clone-structured causal graphs (CSCGs) as a viable, interpretable framework for studying and replicating ICL capabilities typically observed in LLMs.

Key Contributions

The paper seeks to elucidate the mechanisms behind ICL, a capability of LLMs that enables them to learn new tasks from a handful of examples provided at inference time. Despite its significance, the mechanics of ICL remain elusive within the mostly opaque architecture of transformers. By demonstrating ICL in CSCGs, the authors provide an approach that leverages model interpretability to illuminate the process.

The primary contribution of the paper is the establishment of CSCGs as interpretive models capable of understanding ICL through three main processes:

  1. Schema Learning: The model learns template circuits that facilitate pattern completion.
  2. Contextual Template Retrieval: The ability to retrieve relevant templates contingent upon the context.
  3. Rebinding of Tokens: A rebinding process for new tokens to be integrated into existing template slots, allowing for the application of learned structures to new inputs.

These mechanisms are posited to parallel the processes occurring within LLMs, potentially reflecting shared underlying dynamics in emergent model capabilities across different architectures.

Empirical Results

The experimental validations using CSCGs demonstrate several fascinating results:

  • Generalization: CSCGs exhibit transitive generalization similar to LLMs, where unseen sequences that align with learned latent structures can still be assigned meaningful probabilities.
  • Emergence and Overparameterization: Through various datasets, including the novel GINC and LIALT datasets, the paper establishes the role of overparameterization in the emergence of more sophisticated ICL abilities. Similar to traditional LLMs, CSCGs attain higher performance with increased model capacity, which aids in learning intricate template circuits.
  • Rebinding and Novel Token Integration: The CSCG architecture provides a robust explanation for the integration of novel tokens into existing templates—a process not yet fully understood in LLMs—through practical demonstration via the dax test on the PreCo dataset, where new words are absorbed and correctly utilized after a single presentation.

These results not only bolster the understanding of ICL within CSCGs but also suggest potential extensions and adaptations for contemporary models such as transformers.

Theoretical and Practical Implications

The theoretical implications of this paper are profound. The delineation of CSCGs and their interpretative mechanism sets a stage for broader explorations into the mechanisms driving ICL in neural architectures. By providing a model where each component of the process (learning, retrieval, and integration) is clear, researchers have a scaffolding upon which to hypothesize about similar processes in non-transparent models like transformers.

Practically, the insights from this research could be instrumental in designing new model architectures that prioritize interpretability without sacrificing performance. It could also aid in refining existing architectures to mimic the efficient template learning and utilization demonstrated by CSCGs, leading to more capable and reliable AI systems.

Future Directions

The work opens several avenues for future research. One of the central discussions points towards a deeper investigation into how LLMs might implement similar schema learning and token rebinding internally, perhaps via attention mechanisms or other context-aware strategies inherent to their design. Additionally, exploring how these mechanisms scale with increasingly complex data and tasks, or how they might be optimized for efficiency, would be valuable.

In summary, the paper makes meaningful strides towards understanding ICL by advancing an interpretable model that effectively replicates and explains key capabilities. The proposed CSCG framework not only challenges existing perspectives on how in-context learning might operate in LLMs but also invites adaptations of these mechanisms into broader AI research and applications.