Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In-Context Language Learning: Architectures and Algorithms (2401.12973v2)

Published 23 Jan 2024 in cs.CL and cs.LG
In-Context Language Learning: Architectures and Algorithms

Abstract: Large-scale neural LLMs exhibit a remarkable capacity for in-context learning (ICL): they can infer novel functions from datasets provided as input. Most of our current understanding of when and how ICL arises comes from LMs trained on extremely simple learning problems like linear regression and associative recall. There remains a significant gap between these model problems and the "real" ICL exhibited by LMs trained on large text corpora, which involves not just retrieval and function approximation but free-form generation of language and other structured outputs. In this paper, we study ICL through the lens of a new family of model problems we term in context language learning (ICLL). In ICLL, LMs are presented with a set of strings from a formal language, and must generate additional strings from the same language. We focus on in-context learning of regular languages generated by random finite automata. We evaluate a diverse set of neural sequence models (including several RNNs, Transformers, and state-space model variants) on regular ICLL tasks, aiming to answer three questions: (1) Which model classes are empirically capable of ICLL? (2) What algorithmic solutions do successful models implement to perform ICLL? (3) What architectural changes can improve ICLL in less performant models? We first show that Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks. Next, we provide evidence that their ability to do so relies on specialized "n-gram heads" (higher-order variants of induction heads) that compute input-conditional next-token distributions. Finally, we show that hard-wiring these heads into neural models improves performance not just on ICLL, but natural LLMing -- improving the perplexity of 340M-parameter models by up to 1.14 points (6.7%) on the SlimPajama dataset.

Introduction

The advent of powerful neural LLMs has been marked by a growing interest in in-context learning (ICL), where models adapt to new functions or distributions based on provided examples. However, understanding and improving ICL in the field of large-scale LLMs remains a complex challenge. To address this, researchers have begun to hone in on in-context language learning (ICLL) as a means of investigating the capacity for LLMs to reason compositionally about sequences within formal languages—a subset of the broader ICL phenomenon.

ICLL Model Problems

ICLL model problems serve as a structured framework for probing neural networks' abilities to classify and generate language strings belonging to an unfamiliar formal language. Researchers define ICLL as the task where models are given strings sampled from a randomly generated language and must deduce the underlying distribution. This approach advances the paper of ICL by presenting linguistically structured yet compositionally complex problems, reflective of tasks faced by large-scale LLMs.

Methodology

Seeking to decode the proficiency of different neural architectures in ICLL tasks, a systematic experiment was conducted evaluating various sequence models, ranging from traditional RNNs and Transformers to novel state-space variants. These models were challenged with tasks derived from regular languages represented by probabilistic finite automata. The paper pursued three objectives: assessing which classes of models could efficiently conduct ICLL, uncovering the algorithmic solutions and circuits implemented by successful models, and exploring whether insights into ICLL processes could inform architectural advancements.

Results and Insights

The findings were multifaceted. Transformers demonstrated a superior capability in ICLL tasks compared to their recurrent and convolutional counterparts. Their prominence was ascribed in part to specialized "n-gram heads" that distilled next-token distributions by conditioning on preceding strings of tokens—akin to how n-gram models function. Through robust analysis including attention mechanisms, representational probing, and behavioral evaluation, these n-gram heads were pinpointed as a cornerstone of effective ICLL within Transformers.

Architectural Improvements

Drawing from these insights, researchers devised an architectural integration strategy wherein n-gram heads were inserted into both Transformers and non-Transformer architectures. This augmentation not only bolstered performance in artificial ICLL tasks but also enhanced perplexity scores in real-world LLMing. The success of these n-gram head insertions substantiates the idea that LLMs may benefit from incorporating explicit mechanisms reminiscent of more traditional LLMing algorithms.

Conclusion

The exploration into ICLL provides a clearer picture of how large-scale LLMs manage ICL. The introduction of n-gram heads crystallizes the notion that substantial attributes of language learning stem from mechanisms both new and old, challenging and advancing our comprehension of neural sequence models' in-context learning capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. What learning algorithm is in-context learning? investigations with linear models. ArXiv preprint, abs/2211.15661, 2022. URL https://arxiv.org/abs/2211.15661.
  2. Zoology: Measuring and improving recall in efficient language models. ArXiv preprint, abs/2312.04927, 2023. URL https://arxiv.org/abs/2312.04927.
  3. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022. doi:10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7.
  4. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, Online, 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.576. URL https://aclanthology.org/2020.emnlp-main.576.
  5. Understanding in-context learning in transformers and llms by learning to learn discrete functions. ArXiv preprint, abs/2310.03016, 2023. URL https://arxiv.org/abs/2310.03016.
  6. An interpretability illusion for bert. ArXiv preprint, abs/2104.07143, 2021. URL https://arxiv.org/abs/2104.07143.
  7. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
  8. An empirical study of smoothing techniques for language modeling. In 34th Annual Meeting of the Association for Computational Linguistics, pages 310–318, Santa Cruz, California, USA, 1996. Association for Computational Linguistics. doi:10.3115/981863.981904. URL https://aclanthology.org/P96-1041.
  9. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, 2023.
  10. Hungry hungry hippos: Towards language modeling with state space models. ArXiv preprint, abs/2212.14052, 2022. URL https://arxiv.org/abs/2212.14052.
  11. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  12. Compositional semantic parsing with large language models. ArXiv preprint, abs/2209.15003, 2022. URL https://arxiv.org/abs/2209.15003.
  13. Links between probabilistic automata and hidden markov models: probability distributions, learning models and induction algorithms. Pattern Recognition, 38(9):1349–1371, 2005. ISSN 0031-3203. doi:https://doi.org/10.1016/j.patcog.2004.03.020. URL https://www.sciencedirect.com/science/article/pii/S0031320305000233. Grammatical Inference.
  14. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  15. Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  16. What makes instruction learning hard? an investigation and a new challenge in a synthetic environment. arXiv preprint arXiv:2204.09148, 2022.
  17. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  18. Felix A Gers and E Schmidhuber. Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE transactions on neural networks, 12(6):1333–1340, 2001.
  19. E Mark Gold. Language identification in the limit. Information and control, 10(5):447–474, 1967.
  20. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv preprint, abs/2312.00752, 2023. URL https://arxiv.org/abs/2312.00752.
  21. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022a.
  22. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022b. URL https://openreview.net/forum?id=uYLFoz1vlAC.
  23. A theory of emergent in-context learning as implicit structure induction. ArXiv preprint, abs/2303.07971, 2023. URL https://arxiv.org/abs/2303.07971.
  24. RNNs can generate bounded hierarchical languages with optimal memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978–2010, Online, 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.156. URL https://aclanthology.org/2020.emnlp-main.156.
  25. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  26. John Hopcroft. An n log n algorithm for minimizing states in a finite automaton. In Theory of machines and computations, pages 189–196. Elsevier, 1971.
  27. Transformer quality in linear time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 9099–9117. PMLR, 2022. URL https://proceedings.mlr.press/v162/hua22a.html.
  28. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020. URL http://proceedings.mlr.press/v119/katharopoulos20a.html.
  29. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  30. William Merrill. Sequential neural networks as automata. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 1–13, Florence, 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-3901. URL https://aclanthology.org/W19-3901.
  31. William Merrill. On the linguistic capacity of real-time counter automata. arXiv preprint arXiv:2004.06866, 2020.
  32. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023.
  33. Rethinking the role of demonstrations: What makes in-context learning work? ArXiv preprint, abs/2202.12837, 2022. URL https://arxiv.org/abs/2202.12837.
  34. In-context learning and induction heads. ArXiv preprint, abs/2209.11895, 2022. URL https://arxiv.org/abs/2209.11895.
  35. Faster and smaller n-gram language models. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, pages 258–267, 2011.
  36. Rwkv: Reinventing rnns for the transformer era. ArXiv preprint, abs/2305.13048, 2023. URL https://arxiv.org/abs/2305.13048.
  37. Hyena hierarchy: Towards larger convolutional language models. ArXiv preprint, abs/2302.10866, 2023. URL https://arxiv.org/abs/2302.10866.
  38. The devil in linear transformer. ArXiv preprint, abs/2210.10340, 2022. URL https://arxiv.org/abs/2210.10340.
  39. Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
  40. Does string-based neural MT learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526–1534, Austin, Texas, 2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1159. URL https://aclanthology.org/D16-1159.
  41. Few-shot semantic parsing with language models trained on code. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5417–5425, Seattle, United States, 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.396. URL https://aclanthology.org/2022.naacl-main.396.
  42. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  43. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  44. Retentive network: A successor to transformer for large language models. ArXiv preprint, abs/2307.08621, 2023. URL https://arxiv.org/abs/2307.08621.
  45. Memory-augmented recurrent neural networks can learn generalized dyck languages. ArXiv preprint, abs/1911.03329, 2019. URL https://arxiv.org/abs/1911.03329.
  46. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971.
  47. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  48. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023a.
  49. Uncovering mesa-optimization algorithms in transformers. ArXiv preprint, abs/2309.05858, 2023b. URL https://arxiv.org/abs/2309.05858.
  50. Larger language models do in-context learning differently. ArXiv preprint, abs/2303.03846, 2023. URL https://arxiv.org/abs/2303.03846.
  51. Transformers are uninterpretable with myopic methods: a case study with bounded dyck grammars. ArXiv preprint, abs/2312.01429, 2023. URL https://arxiv.org/abs/2312.01429.
  52. An explanation of in-context learning as implicit bayesian inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI.
  53. Gated linear attention transformers with hardware-efficient training. ArXiv preprint, abs/2312.06635, 2023. URL https://arxiv.org/abs/2312.06635.
  54. An attention free transformer. ArXiv preprint, abs/2105.14103, 2021. URL https://arxiv.org/abs/2105.14103.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ekin Akyürek (25 papers)
  2. Bailin Wang (34 papers)
  3. Yoon Kim (92 papers)
  4. Jacob Andreas (116 papers)
Citations (28)
Youtube Logo Streamline Icon: https://streamlinehq.com