Abstract: Recent work analyzing in-context learning (ICL) has identified a broad set of strategies that describe model behavior in different experimental conditions. We aim to unify these findings by asking why a model learns these disparate strategies in the first place. Specifically, we start with the observation that when trained to learn a mixture of tasks, as is popular in the literature, the strategies learned by a model for performing ICL can be captured by a family of Bayesian predictors: a memorizing predictor, which assumes a discrete prior on the set of seen tasks, and a generalizing predictor, wherein the prior matches the underlying task distribution. Adopting the lens of rational analysis from cognitive science, where a learner's behavior is explained as an optimal adaptation to data given computational constraints, we develop a hierarchical Bayesian framework that almost perfectly predicts Transformer next token predictions throughout training without assuming access to its weights. Under this framework, pretraining is viewed as a process of updating the posterior probability of different strategies, and its inference-time behavior as a posterior-weighted average over these strategies' predictions. Our framework draws on common assumptions about neural network learning dynamics, which make explicit a tradeoff between loss and complexity among candidate strategies: beyond how well it explains the data, a model's preference towards implementing a strategy is dictated by its complexity. This helps explain well-known ICL phenomena, while offering novel predictions: e.g., we show a superlinear trend in the timescale for transition to memorization as task diversity is increased. Overall, our work advances an explanatory and predictive account of ICL grounded in tradeoffs between strategy loss and complexity.
Rational Emergence of In-Context Learning Strategies in Transformers
The paper "In-Context Learning Strategies Emerge Rationally" (Wurgaft et al., 21 Jun 2025) presents a comprehensive theoretical and empirical analysis of why Transformers, when trained on mixtures of tasks, exhibit distinct in-context learning (ICL) strategies—specifically, transitions between memorization and generalization. The authors unify disparate findings in the ICL literature by introducing a hierarchical Bayesian framework grounded in rational analysis, which models the emergence and dynamics of these strategies as optimal adaptations under computational constraints.
Unifying ICL Strategies: Memorization vs. Generalization
The central empirical observation is that, across a range of controlled synthetic tasks (sequence modeling, linear regression, and classification), Transformer models trained on mixtures of tasks display a sharp transition in behavior as a function of task diversity and training duration. For low task diversity or long training, models behave as memorizing predictors—they act as Bayesian predictors with a discrete prior over the set of seen tasks, effectively memorizing the training set. For high task diversity or early in training, models behave as generalizing predictors—they act as Bayesian predictors with a continuous prior over the true task-generating distribution, enabling generalization to unseen tasks.
This dichotomy recapitulates and unifies prior observations of task diversity thresholds and transient generalization in ICL. The authors formalize these predictors for each experimental setting, providing closed-form solutions (e.g., ridge regression for generalization in linear regression, empirical unigram statistics for generalization in sequence modeling).
Hierarchical Bayesian Framework and Rational Analysis
To explain why and when models prefer one strategy over the other, the authors adopt the lens of rational analysis from cognitive science, positing that model behavior is an optimal adaptation to data and computational constraints. They propose a hierarchical Bayesian model in which the model's inference-time predictions are a posterior-weighted average of the memorizing and generalizing predictors. The posterior weights are determined by a tradeoff between the predictors' empirical loss (fit to data) and their implementation complexity (modeled via Kolmogorov complexity).
The key modeling assumptions are:
Power-law scaling of loss with dataset size: L(N)≈L(∞)+A/Nα, reflecting sublinear sample efficiency in neural networks.
Simplicity bias: The prior probability of a predictor is exponentially penalized by its Kolmogorov complexity, p(Q)∝2−K(Q)β.
The resulting log-posterior odds between memorizing and generalizing predictors is:
η(N,D)=γN1−αΔL(D)−ΔK(D)β
where ΔL(D) is the loss difference, ΔK(D) is the complexity difference, and N, D are training steps and task diversity, respectively. The model's predictions are then a sigmoid-weighted interpolation between the two predictors.
Empirical Validation and Quantitative Predictivity
The framework is validated across three task families, with extensive experiments showing that the hierarchical Bayesian model almost perfectly predicts the next-token predictions of trained Transformers throughout training, without access to model weights. The model achieves mean R2 of 0.97 in linear regression, mean agreement of 0.92 in classification, and mean Spearman rank correlation of 0.97 in sequence modeling. The posterior probabilities of the memorizing predictor given by the model align almost exactly with the empirical relative distances between model outputs and the two predictors.
Notably, the model captures both previously reported phenomena and makes novel predictions:
Sublinear growth: The transition from generalization to memorization with training steps is sublinear and follows a sigmoidal curve in N1−α.
Sharp phase transition: The crossover between strategies is rapid, with small changes in N or D yielding large changes in model behavior.
Superlinear scaling of transience: The time to transition from generalization to memorization grows superlinearly with task diversity, and can diverge if the loss difference between predictors vanishes.
Loss-Complexity Tradeoff and Scaling Effects
The analysis reveals that the loss-complexity tradeoff is fundamental to ICL dynamics. Early in training or at high task diversity, the simplicity bias favors the generalizing predictor. As training proceeds, the empirical loss term can overwhelm the complexity penalty, leading to a shift toward memorization. Increasing model capacity (e.g., MLP width) reduces the effective complexity penalty, making memorization more likely—a result confirmed empirically and captured by the model.
Theoretical and Practical Implications
The work provides a quantitative, predictive, and explanatory account of ICL phenomena, unifying behavioral, developmental, and mechanistic findings in the literature. It demonstrates that the emergence and transience of generalization in ICL can be understood as rational adaptations to data structure and computational constraints, rather than as idiosyncratic artifacts of architecture or optimization.
Practical implications include:
Predicting generalization failure: The framework can be used to anticipate when models will cease to generalize to new tasks, based on training duration, task diversity, and model capacity.
Designing training curricula: By manipulating task diversity and training schedules, practitioners can control the balance between memorization and generalization in deployed models.
Model selection and scaling: The explicit role of complexity bias suggests that scaling model size or modifying architectural inductive biases can systematically shift the memorization-generalization tradeoff.
Theoretical implications extend to the broader understanding of deep learning. The results support the view that neural networks can be interpreted as approximately Bayesian learners with a simplicity bias and sublinear sample efficiency. The hierarchical Bayesian perspective may generalize to other phenomena in deep learning, such as grokking, double descent, and the emergence of algorithmic reasoning.
Limitations and Future Directions
The analysis is primarily conducted in settings where two predictors suffice to explain model behavior. Extending the framework to settings with more complex or hierarchical strategy spaces (e.g., mixtures of Markov chains) is a natural next step. The current complexity measure, based on code and data compression, may not fully capture the implementation cost in neural architectures; integrating more refined measures of effective parameter usage is warranted.
Connecting the top-down rational analysis to bottom-up mechanistic accounts (e.g., circuit-level interpretability) remains an open challenge. Further, the framework currently focuses on in-distribution generalization; extending it to out-of-distribution and real-world tasks is an important avenue for future work.
Conclusion
This work advances a unified, normative account of in-context learning in Transformers, grounded in a loss-complexity tradeoff formalized by a hierarchical Bayesian model. The framework is both explanatory and predictive, capturing the dynamics of memorization and generalization across tasks, architectures, and training regimes. It provides actionable insights for both the design and interpretation of large-scale neural models, and suggests that rational analysis can serve as a powerful tool for understanding the emergent behavior of deep learning systems.