In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly
This paper presents a systematic investigation into the inductive biases of transformers in in-context learning (ICL) scenarios where tasks are drawn from hierarchically structured hypothesis classes. The central claim is that transformers, when trained on mixtures of tasks with varying complexity, consistently select the simplest hypothesis sufficient to explain the in-context data, rather than defaulting to the most expressive available model. This behavior is theoretically justified via a Bayesian framework, and is empirically validated across synthetic and real-world settings, including Markov chains, linear regression, probabilistic context-free grammars (PCFGs), and large pretrained LLMs such as GPT-4.
Problem Setting and Motivation
Prior work on ICL has largely focused on fixed-complexity tasks, such as order-1 Markov chains or standard linear regression, where the model class is unambiguous. However, real-world applications often present ambiguous contexts that can be explained by both simple and complex hypotheses. The authors address the following key questions:
- Can transformers, when trained on a mixture of tasks with hierarchical complexity (e.g., order-1 and order-3 Markov chains), infer the true underlying complexity of the context at inference time?
- Do transformers default to the most expressive hypothesis, or do they prefer the simplest sufficient explanation, in line with Occam's razor?
Experimental Framework
The authors construct controlled testbeds in which higher-complexity task categories are strict supersets of simpler ones. For example, any order-1 Markov chain can be represented as a special case of an order-3 chain. The main experimental settings are:
- Markov Chains: Transformers are trained on sequences generated by both order-1 and order-k Markov chains. At inference, the model is prompted with sequences from either category, and its output distribution is compared (via KL divergence) to the empirical n-gram statistics of the context.
- Linear Regression: The model is trained on a mixture of full-dimensional and lower-dimensional (sparse) linear regression tasks. At inference, the model's predictions are compared to least-squares solutions in both the full and restricted subspaces.
- PCFGs and Boolean Functions: Additional experiments extend the analysis to PCFGs and Boolean function tasks, including evaluation on GPT-4.
Key Empirical Findings
- Complexity Selection: Transformers reliably infer the correct complexity class of the context. When prompted with data generated by a simple process, the model's predictions align with the statistics of the simple class, even though the complex class could also fit the data perfectly.
- No Default to Complexity: When trained only on the complex class, transformers do not generalize to lower-complexity statistics at inference. The Occam's razor-like bias emerges only when the training distribution includes both simple and complex tasks.
- Robustness: The simplicity bias persists across variations in model size, training mixture proportions, and context length. Larger models converge faster, and the bias is robust even when the simple class is underrepresented in the training mix.
- Comparison with LSTMs: LSTMs exhibit a weaker form of this inductive bias, requiring significantly more capacity to match the complexity selection behavior of transformers.
Theoretical Analysis
The authors provide a Bayesian explanation for the observed simplicity bias. When the context is compatible with multiple hypothesis classes, the marginal likelihood under a Bayesian model decomposes into an empirical fit term and a complexity penalty (akin to BIC). For large context lengths, the likelihoods under all compatible models are similar, but the complexity penalty strongly favors the simplest model. Thus, the posterior over model classes concentrates on the minimal sufficient hypothesis, implementing a form of Bayesian Occam's razor in-context.
The paper also provides a constructive proof that a two-layer attention-only transformer can, in principle, compute the necessary empirical statistics and perform Bayesian model selection over Markov chain orders.
Numerical Results
- In Markov chain experiments, the KL divergence between the model's output and the correct n-gram statistics approaches zero for the true order, and remains significantly higher for more complex statistics when the context is generated by a simple process.
- In linear regression, the model's predictions align with the lower-dimensional least-squares solution when the context is generated by a sparse regressor, even though the full-dimensional solution interpolates the data equally well.
- On GPT-4, when prompted with ambiguous Boolean function tasks, the model's predictions on unambiguous queries match the simple function, confirming the presence of the simplicity bias in large pretrained LLMs.
Implications and Future Directions
Practical Implications:
- Generalization: The Occam's razor-like inductive bias may underlie the strong generalization capabilities of transformers in real-world ICL scenarios, where task complexity is not known a priori.
- Task Mixture Design: For applications requiring robust complexity selection, it is essential to include a representative mixture of task complexities during pretraining.
- Model Selection in Context: Transformers can serve as in-context model selectors, dynamically adapting their inference strategy to the complexity of the observed data.
Theoretical Implications:
- The results extend the Bayesian view of ICL to hierarchical hypothesis classes, providing a principled explanation for simplicity preference in transformers.
- The constructive analysis of transformer architectures for empirical statistic computation offers a foundation for mechanistic interpretability studies.
Future Directions:
- Mechanistic Interpretability: Elucidating the specific circuit components and training dynamics that give rise to in-context complexity selection.
- Beyond Dimensionality: Extending the analysis to more intricate forms of hierarchical structure, such as variable sparsity patterns or compositional grammars.
- Scaling Laws: Investigating how the simplicity bias scales with model size, data diversity, and context length in large-scale pretraining regimes.
- Alternative Architectures: Exploring whether other sequence models (e.g., RNNs, MLPs) can be induced to exhibit similar inductive biases under different training regimes.
Conclusion
This work provides strong empirical and theoretical evidence that transformers, when trained on mixtures of tasks with hierarchical complexity, implement an in-context Occam's razor by selecting the simplest sufficient hypothesis for the observed data. This inductive bias is robust, emerges naturally from standard training objectives, and is present in both synthetic and real-world settings, including large pretrained LLMs. The findings have significant implications for the design, training, and interpretability of transformer-based models in diverse ICL applications.