Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability (2310.08049v3)

Published 12 Oct 2023 in cs.LG

Abstract: What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate thirteen model architectures capable of causal LLMing across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying the number of in-context examples and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with the option to memorize rather than leverage in-context examples. Finally, and somewhat surprisingly, we find that several attention alternatives are sometimes competitive with or better in-context learners than transformers. However, no single architecture demonstrates consistency across all tasks, with performance either plateauing or declining when confronted with a significantly larger number of in-context examples than those encountered during gradient-based training.

PDF HTML Abstract

Exploring the Relationship Between Model Architecture and In-Context Learning Ability

The paper by Ivan Lee, Nan Jiang, and Taylor Berg-Kirkpatrick investigates the complex relationship between model architecture and the ability for in-context learning (ICL) across various neural architectures. This exploration focuses on evaluating the ICL capacity across a diverse set of models, including recurrent neural networks (RNNs), convolution-based models, transformers, and state space model-inspired architectures.

Key Observations

Universality of ICL: Remarkably, the paper finds that all evaluated architectures, including non-transformer models, demonstrated the ability to perform ICL in various synthetic tasks. This contrasts with prior assumptions that ICL was predominantly the domain of attention-based models like transformers. The implications challenge existing paradigms and suggest that the potential for ICL may be an architectural feature of neural networks at large.
Efficiency and Consistency: The paper highlights significant differences in the statistical efficiency and consistency of ICL across the examined architectures. For example, while transformers, especially those without positional embeddings, showed competitive ICL performance, they faced challenges in consistency when exposed to examples longer than those seen during training.
Attention Alternatives: Attention alternatives such as Hyena and Mamba, rooted in state space models, showed promising results, often surpassing transformers in tasks like associative recall and multiclass classification when the number of in-context examples increased. This finding supports further exploration of state space models as viable and perhaps superior alternatives to transformers under certain conditions.
Influence of Training Data: The paper underscores how training data's distributional properties, such as burstiness, critically influence ICL. Particularly, architectures like Llama2 and Hyena exhibited a propensity for ICL when the training data included bursty examples, suggesting that data characteristics are integral for enabling in-context learning.

Theoretical and Practical Implications

On a theoretical level, the results suggest that in-context learning is not just an artifact of attention mechanisms but rather a pervasive phenomenon that emerges across different neural architectures. This understanding pushes the boundaries of how researchers conceptualize learning dynamics in neural networks. Practically, the evidence for non-transformer models' ICL capabilities marks a shift in the potential applications of traditional architectures like RNNs and CNNs, especially in contexts where resource constraints make transformers less viable.

Future Directions

The paper opens several avenues for future research. Further investigation is warranted into the specific mechanisms underlying ICL in architectures beyond transformers, such as the potential analogs to induction heads in these models. Additionally, exploring the practical applications of these findings in real-world scenarios, such as LLMing with efficient architectures, could yield significant advancements in deployment strategies for machine learning models.

Overall, this paper provides a comprehensive empirical framework for evaluating ICL across model architectures, delivering insights into the universality of ICL mechanisms and challenging the transformer-centric view of in-context learning.

PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (3)

Ivan Lee (28 papers)
Nan Jiang (210 papers)
Taylor Berg-Kirkpatrick (106 papers)

Citations (12)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/nanjiangwill/status/1755301860511223950

https://twitter.com/fly51fly/status/1756449966003011689

https://twitter.com/knishimae0531/status/1755384475272548492

https://twitter.com/knishimae0531/status/1776458107302126017

https://twitter.com/nanjiangwill/status/1755275280665711021

https://twitter.com/basedneoleo/status/1893803759626592452