In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly (2506.19351v1)

Published 24 Jun 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates. While existing research has typically studied ICL in fixed-complexity environments, practical LLMs encounter tasks spanning diverse complexity levels. This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones. We design well-controlled testbeds based on Markov chains and linear regression that reveal transformers not only identify the appropriate complexity level for each task but also accurately infer the corresponding parameters--even when the in-context examples are compatible with multiple complexity hypotheses. Notably, when presented with data generated by simpler processes, transformers consistently favor the least complex sufficient explanation. We theoretically explain this behavior through a Bayesian framework, demonstrating that transformers effectively implement an in-context Bayesian Occam's razor by balancing model fit against complexity penalties. We further ablate on the roles of model size, training mixture distribution, inference context length, and architecture. Finally, we validate this Occam's razor-like inductive bias on a pretrained GPT-4 model with Boolean-function tasks as case study, suggesting it may be inherent to transformers trained on diverse task distributions.

PDF Abstract

In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly

This paper provides a systematic investigation into the inductive biases of transformers in in-context learning (ICL) scenarios where tasks are organized hierarchically by complexity. The central claim is that transformers, when trained on mixtures of tasks with varying complexity, consistently select the simplest hypothesis sufficient to explain the in-context data, rather than defaulting to the most expressive hypothesis class available. This behavior is theoretically justified via a Bayesian framework and empirically validated across synthetic and real-world settings, including Markov chains, linear regression, probabilistic context-free grammars (PCFGs), and large pretrained models such as GPT-4.

Problem Setting and Motivation

ICL enables transformers to adapt to new tasks by conditioning on a prompt of input-output examples, without parameter updates. Prior work has largely focused on fixed-complexity tasks, but real-world applications require models to handle a spectrum of task complexities. The authors address the question: When presented with data compatible with multiple hypothesis classes, do transformers select the simplest sufficient explanation, or do they default to the most complex?

To probe this, the authors construct controlled testbeds where higher-complexity task classes strictly contain lower-complexity ones (e.g., higher-order Markov chains can represent all lower-order chains). This setup introduces ambiguity: for data generated by a simple process, both simple and complex hypotheses can explain the data perfectly.

Experimental Framework

Markov Chains

Transformers are trained on sequences generated by both order-1 (simple) and higher-order (complex) Markov chains. At inference, the model is prompted with sequences from either class. The key evaluation metric is the KL divergence between the model's output distribution and the empirical $n$ -gram statistics of the context (e.g., bigram for order-1, tetragram for order-3).

Findings:

The transformer accurately infers the true order of the generating process from the context.
When prompted with order-1 data, the model's predictions align with bigram statistics, not higher-order statistics, despite the latter being expressive enough to fit the data.
When prompted with higher-order data, the model switches to the appropriate higher-order statistics.

Linear Regression

A similar approach is used for linear regression, with two task categories: a "simple" category where the regressor lies in a lower-dimensional subspace, and a "complex" category using the full feature space. Both categories can perfectly fit data generated by the simple regressor.

Findings:

When prompted with data from the simple regressor, the transformer aligns its predictions with the lower-dimensional least-squares solution, not the full-dimensional one.
For complex data, the model uses the full-dimensional solution.

Probabilistic Context-Free Grammars (PCFGs)

Transformers trained on mixtures of simple and complex PCFGs also exhibit the same inductive bias, inferring the correct grammar type from the context.

Pretrained LLMs (GPT-4)

The authors prompt GPT-4 with Boolean function tasks where both simple and complex functions can explain the context. GPT-4 consistently uses the simpler function when both suffice, and only resorts to the complex function when necessary.

Theoretical Analysis

The authors provide a Bayesian explanation for the observed behavior. For both Markov chains and linear regression, the Bayes-optimal predictive distribution is a mixture over hypothesis classes, weighted by their posterior probabilities given the context. The marginal likelihood for each class decomposes into a data fit term and a complexity penalty (akin to BIC). When the data can be perfectly explained by a simple hypothesis, the complexity penalty dominates, and the posterior concentrates on the simplest sufficient class—realizing a Bayesian Occam's razor.

Key Equations

For Markov chains, the marginal likelihood for order- $s$ is approximated as:

1	log p(X\|s) ≈ sum_t log p̂_X(x_t \| x_{t-1}, ..., x_{t-s}) - (V^s(V-1)/2) * log T

where

p̂_X

are empirical conditional probabilities. The complexity penalty grows rapidly with

s

, favoring simpler models unless the data fit is substantially better for a complex model.

Ablation Studies and Additional Results

Training on Only Complex Tasks: Transformers trained solely on the most complex class do not generalize to lower-complexity statistics at inference, indicating that the Occam's razor bias emerges only when the training distribution includes multiple complexity levels.
Effect of Training Mixture: The inductive bias persists even when simple tasks are a small fraction of the training mix, though learning higher-complexity tasks becomes harder as the mix becomes more imbalanced.
Model Scale: Larger transformers converge faster and more reliably to the correct inductive bias.
Comparison with LSTMs: LSTMs require significantly more capacity to exhibit similar behavior, and their inductive bias is weaker.

Implications

Practical

Robustness in Real-World Applications: The Occam's razor inductive bias suggests that transformers can generalize robustly across tasks of varying complexity, automatically selecting the simplest sufficient model in context. This is highly desirable in settings where task complexity is unknown or variable.
Prompt Engineering: When designing prompts for ICL, including examples that are compatible with multiple hypotheses will likely lead the model to prefer simpler explanations, unless evidence for complexity is explicit.
Model Selection and Interpretability: The Bayesian framework provides a principled lens for understanding and predicting model behavior in ambiguous settings, which can inform debugging and interpretability efforts.

Theoretical

Mechanistic Understanding: The results motivate further investigation into the internal mechanisms by which transformers implement Bayesian model selection and complexity penalties.
Optimization Dynamics: The emergence of Occam's razor through gradient-based training on diverse task mixtures raises questions about the interplay between optimization, architecture, and inductive bias.

Future Directions

Generalization to Other Hierarchies: Extending the analysis to more complex or structured hierarchies (e.g., variable sparsity, non-nested hypothesis classes).
Circuit Analysis: Identifying specific architectural components or attention patterns responsible for complexity selection.
Scaling Laws: Quantifying how model size, training distribution, and context length interact to shape the inductive bias.
Broader Model Classes: Investigating whether similar biases emerge in other architectures or in multi-modal settings.

Strong Numerical Results and Claims

Transformers trained on mixtures of order-1 and order-3 Markov chains achieve near-zero KL divergence to the correct $n$ -gram statistics for each context type, demonstrating precise complexity selection.
In linear regression, the mean squared error between the transformer's predictions and the correct least-squares solution (simple or complex) approaches zero, depending on the context.
GPT-4, when prompted with ambiguous Boolean function tasks, matches the simple function's predictions with high accuracy, confirming the inductive bias in large-scale pretrained models.

Conclusion

This work rigorously demonstrates that transformers, when trained on diverse task distributions with hierarchical complexity, implement a form of Bayesian Occam's razor in-context. This inductive bias is robust across architectures, task domains, and scales, and is theoretically grounded in Bayesian model selection. The findings have significant implications for the design, analysis, and deployment of transformer-based models in real-world applications where task complexity is variable or ambiguous.