Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks (2406.02550v2)

Published 4 Jun 2024 in cs.LG, cond-mat.dis-nn, hep-th, and stat.ML

Abstract: LLMs can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a \, x + b \, y \;\mathrm{mod}\; p$ labeled by the vector $(a, b) \in \mathbb{Z}_p^2$. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is \emph{transient}, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing highly structured representations in both attention heads and MLPs; and discuss the learned algorithms. Notably, we find an algorithmic shift in deeper models, as we go from few to many in-context examples.

Citations (2)

View on Semantic Scholar

Summary

The paper shows that increasing the diversity of pre-training modular arithmetic tasks enables large language models to transition from in-distribution memorization to robust out-of-distribution generalization with as few as two transformer blocks.
It details the use of structured modular functions through linear equations to uncover emergent attention mechanisms, including distinctive 'clock-of-clocks' patterns.
The paper highlights that optimizing MLP and LayerNorm layers helps combine in-context examples, paving the way for automating complex problem-solving in novel scenarios.

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Introduction

The paper presented by He, Doshi, Das, and Gromov explores the capabilities of LLMs to generalize from in-distribution (i.d.) to out-of-distribution (o.o.d.) tasks in the context of learning modular arithmetic. Specifically, the research explores the mechanisms of in-context learning (ICL) and skill composition, facilitated by pre-training with a variety of modular arithmetic tasks defined by linear functions of the form $z = a x + b y \, \mathrm{mod} \, p$ . This investigation is pivotal in understanding how LLMs acquire and combine simpler skills to solve more complex and previously unseen tasks, shedding light on the emergent behaviors that arise from expansive training.

Methods and Experimentation

The modular arithmetic tasks employed involve learning finite collections of linear modular functions. Pre-training is performed on a subset of these tasks, while the remaining tasks are reserved for o.o.d. testing. This setup allows the empirical analysis of how transformers, particularly GPT-style models, develop the ability to generalize beyond their training data based on the diversity and quantity of pre-training tasks.

The empirical paper makes use of models with varied depths, attentions heads, and embedding dimensions, adhering to a consistent setup involving 512 embedding dimensions and 4 attention heads. Training routines employ an AdamW optimizer with careful task selection and balanced batching to avoid overfitting specific tasks.

Key Findings

Transition to O.o.D. Generalization

A notable observation is the transition in model performance from in-distribution to out-of-distribution generalization as the diversity of pre-training tasks increases. The results indicate that a minimum of two transformer blocks is necessary for models to exhibit any form of o.o.d. generalization. For deeper models, the generalization phase tends to be transient, necessitating early stopping to capture optimal performance.

Accurate generalization to new, unseen tasks is quantitatively confirmed by phase diagrams which delineate four distinct phases of model behavior based on the number of pre-training tasks and examples per task:

In-distribution memorization
In-distribution generalization
Out-of-distribution memorization
Out-of-distribution generalization

These phases signify the scaling of model capabilities from simple memorization to sophisticated task derivation.

Interpretability and Model Mechanisms

The interpretability analysis reveals structured internal representations that models develop post-training. Through layer-wise examination, it is uncovered that specific attention heads implement critical skills such as modular mapping and arithmetic operations fundamental to solving modular functions. These heads exhibit distinctive "clock-of-clocks" patterns, underscoring the model's mechanism for managing modular arithmetic operations effectively.

The investigation extends to understanding the roles of MLP layers and LayerNorm in facilitating the combination of in-context examples. While explicit signals from these layers remain elusive, evidence suggests that the deeper layers optimize for combining rescaled examples within the learned structured representation space, enabling accurate generalization.

Implications and Future Directions

Practically, this research amplifies the potential of LLMs in automating complex problem solving without pre-specified task formulations. By learning to identify and leverage task representations directly from context, models reduce the necessity for extensive manual fine-tuning across diverse domains. Theoretically, the findings indicate that the scalability of LLMs to larger datasets and higher-dimensional tasks hinges on emergent structured representations and the nuanced interaction of model components.

Future work should explore the extension of these findings to more generalized AI contexts, where tasks may not be as cleanly describable as modular functions. There is also a need for deeper mechanistic interpretability, especially in deciphering MLP contributions and refining the notion of task diversity necessary for robust o.o.d. generalization.

By advancing the understanding of how complex task-solving capabilities emerge in neural architectures, this paper provides a framework for future explorations in AI development, promising strides towards more autonomous and versatile AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/burny_tech/status/1798757680439582729

YouTube

Show All Videos