Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability (2310.08049v3)

Published 12 Oct 2023 in cs.LG

Abstract: What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate thirteen model architectures capable of causal LLMing across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying the number of in-context examples and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with the option to memorize rather than leverage in-context examples. Finally, and somewhat surprisingly, we find that several attention alternatives are sometimes competitive with or better in-context learners than transformers. However, no single architecture demonstrates consistency across all tasks, with performance either plateauing or declining when confronted with a significantly larger number of in-context examples than those encountered during gradient-based training.

Exploring the Relationship Between Model Architecture and In-Context Learning Ability

The paper by Ivan Lee, Nan Jiang, and Taylor Berg-Kirkpatrick investigates the complex relationship between model architecture and the ability for in-context learning (ICL) across various neural architectures. This exploration focuses on evaluating the ICL capacity across a diverse set of models, including recurrent neural networks (RNNs), convolution-based models, transformers, and state space model-inspired architectures.

Key Observations

  1. Universality of ICL: Remarkably, the paper finds that all evaluated architectures, including non-transformer models, demonstrated the ability to perform ICL in various synthetic tasks. This contrasts with prior assumptions that ICL was predominantly the domain of attention-based models like transformers. The implications challenge existing paradigms and suggest that the potential for ICL may be an architectural feature of neural networks at large.
  2. Efficiency and Consistency: The paper highlights significant differences in the statistical efficiency and consistency of ICL across the examined architectures. For example, while transformers, especially those without positional embeddings, showed competitive ICL performance, they faced challenges in consistency when exposed to examples longer than those seen during training.
  3. Attention Alternatives: Attention alternatives such as Hyena and Mamba, rooted in state space models, showed promising results, often surpassing transformers in tasks like associative recall and multiclass classification when the number of in-context examples increased. This finding supports further exploration of state space models as viable and perhaps superior alternatives to transformers under certain conditions.
  4. Influence of Training Data: The paper underscores how training data's distributional properties, such as burstiness, critically influence ICL. Particularly, architectures like Llama2 and Hyena exhibited a propensity for ICL when the training data included bursty examples, suggesting that data characteristics are integral for enabling in-context learning.

Theoretical and Practical Implications

On a theoretical level, the results suggest that in-context learning is not just an artifact of attention mechanisms but rather a pervasive phenomenon that emerges across different neural architectures. This understanding pushes the boundaries of how researchers conceptualize learning dynamics in neural networks. Practically, the evidence for non-transformer models' ICL capabilities marks a shift in the potential applications of traditional architectures like RNNs and CNNs, especially in contexts where resource constraints make transformers less viable.

Future Directions

The paper opens several avenues for future research. Further investigation is warranted into the specific mechanisms underlying ICL in architectures beyond transformers, such as the potential analogs to induction heads in these models. Additionally, exploring the practical applications of these findings in real-world scenarios, such as LLMing with efficient architectures, could yield significant advancements in deployment strategies for machine learning models.

Overall, this paper provides a comprehensive empirical framework for evaluating ICL across model architectures, delivering insights into the universality of ICL mechanisms and challenging the transformer-centric view of in-context learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. What learning algorithm is in-context learning? investigations with linear models, 2023.
  2. Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE, 4(11):e7678, November 2009. ISSN 1932-6203. doi: 10.1371/journal.pone.0007678. URL http://dx.doi.org/10.1371/journal.pone.0007678.
  3. Using fast weights to attend to the recent past, 2016.
  4. Language models are few-shot learners, 2020.
  5. Data distributional properties drive emergent in-context learning in transformers, 4 2022. URL http://arxiv.org/abs/2205.05055v6.
  6. Learning phrase representations using rnn encoder-decoder for statistical machine translation, 2014.
  7. Tinystories: How small can language models be and still speak coherent english?, 2023.
  8. Hungry hungry hippos: Towards language modeling with state space models, 2023.
  9. What can transformers learn in-context? a case study of simple function classes, 2023.
  10. Hippo: Recurrent memory with optimal polynomial projections, 2020.
  11. Efficiently modeling long sequences with structured state spaces, 10 2021. URL http://arxiv.org/abs/2111.00396v3.
  12. Deep residual learning for image recognition, 2015.
  13. Long short-term memory. Neural Computation, 9:1735–1780, 1997. URL https://api.semanticscholar.org/CorpusID:1915014.
  14. Scaling laws for neural language models, 2020.
  15. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020.
  16. Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  284–294, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1027. URL https://aclanthology.org/P18-1027.
  17. The omniglot challenge: a 3-year progress report, 2019.
  18. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp.  100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
  19. Rethinking the role of demonstrations: What makes in-context learning work?, 2022.
  20. In-context learning and induction heads, 9 2022. URL http://arxiv.org/abs/2209.11895v1.
  21. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  22. Pytorch: An imperative style, high-performance deep learning library, 2019.
  23. Rwkv: Reinventing rnns for the transformer era, 5 2023. URL http://arxiv.org/abs/2305.13048v1.
  24. Hyena hierarchy: Towards larger convolutional language models, 2 2023. URL http://arxiv.org/abs/2302.10866v3.
  25. Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
  26. Language models are unsupervised multitask learners, 2019. URL https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe.
  27. Learning representations by back-propagating errors. Nature, 323:533–536, 1986. URL https://api.semanticscholar.org/CorpusID:205001834.
  28. Noam Shazeer. Glu variants improve transformer, 2020.
  29. Rigid-motion scattering for texture classification, 2014.
  30. Roformer: Enhanced transformer with rotary position embedding, 2022.
  31. A length-extrapolatable transformer, 2022.
  32. Retentive network: A successor to transformer for large language models, 7 2023. URL http://arxiv.org/abs/2307.08621v1.
  33. Long range arena: A benchmark for efficient transformers, 2020.
  34. Efficient transformers: A survey. ACM Comput. Surv., 55(6), dec 2022a. ISSN 0360-0300. doi: 10.1145/3530811. URL https://doi.org/10.1145/3530811.
  35. Are pre-trained convolutions better than pre-trained transformers?, 2022b.
  36. Llama 2: Open foundation and fine-tuned chat models, 2023.
  37. Attention is all you need, 2017.
  38. Transformers learn in-context by gradient descent, 12 2022. URL http://arxiv.org/abs/2212.07677v2.
  39. Larger language models do in-context learning differently, 2023.
  40. Huggingface’s transformers: State-of-the-art natural language processing, 2020.
  41. Pay less attention with lightweight and dynamic convolutions, 2019.
  42. An explanation of in-context learning as implicit bayesian inference, 2021. URL https://arxiv.org/abs/2111.02080.
  43. An attention free transformer, 2021.
  44. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  12697–12706. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/zhao21c.html.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ivan Lee (28 papers)
  2. Nan Jiang (210 papers)
  3. Taylor Berg-Kirkpatrick (106 papers)
Citations (12)