Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains (2508.07208v1)

Published 10 Aug 2025 in cs.LG and cs.AI

Abstract: In-context learning (ICL) is a haLLMark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that a two-layer transformer can represent any k-order Markov chain using an induction head mechanism.
  • It uses a first layer to capture preceding context and a second layer to predict future tokens via cosine similarity.
  • Experimental results validate that shallow transformer architectures generalize complex sequential tasks with high efficiency.

Two-Layer Transformers: Induction Heads on Markov Chains

Overview

The paper "What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains" focuses on the in-context learning capabilities of transformers, using induction heads to efficiently represent higher-order Markov processes with only two transformer layers. The authors demonstrate that this architecture can handle conditional kk-gram models, which are equivalent to these higher-order Markov chains.

Representation Power

The key insight is that a two-layer transformer, each with a single attention head, is sufficient to represent any kk-order Markov process. The transformer’s first layer attends to preceding tokens to capture context, while the second layer focuses on sequence tokens for prediction, achieving the conditional kk-gram representation.

Architecture Details

First Layer: Attends to preceding context, with focus on positions {n1,n2,,nk}\{n-1, n-2, \ldots, n-k\} to compute a context vector vnv_n. Second Layer: Functions as an induction head, computing attention scores proportional to the cosine similarity between past context vectors vnv_n and future token predictions, ultimately forming a kk-th order induction head. Figure 1

Figure 1

Figure 1: The conditional kk-gram model.

Learning Dynamics

The learning dynamics focus on how effectively in-context representations can emerge during training. For first-order Markov chains, the simplified transformer model learns induction heads using gradient descent, thus approximating the conditional empirical distribution.

Experimental Results

The experiments involve training transformers on binary sequences generated by higher-order Markov processes. The attention maps learned mimic the kk-gram estimator patterns, validating the model's ability to generalize over unseen sequences.

Analysis and Implications

This work demonstrates that shallow transformer architectures can efficiently perform complex in-context learning tasks traditionally thought to require more depth. This has significant implications for model efficiency in terms of computational resources, indicating potential for real-world applications where model efficiency and interpretability are critical.

Application and Future Directions

The findings suggest transformers have untapped potential in learning structured sequence tasks, opening avenues for theoretical explorations on model generalization, and further practical applications in areas relying on sequence modeling such as natural language processing and bioinformatics.

Conclusion

Through theoretical and empirical validation, the paper challenges the traditional understanding of transformers, showcasing that even architectures with minimal complexity can achieve sophisticated in-context learning capabilities when properly configured and trained.

X Twitter Logo Streamline Icon: https://streamlinehq.com