Markov-Chain Transformers

Updated 4 July 2026

Markov-Chain Transformers are a family of models that replace full-history dependence with finite-memory or state-based systems, leveraging Markov structures for prediction.
They implement various techniques such as finite-order decoders, bounded-context CRFs, and explicit state bottlenecks to streamline sequence generation and improve interpretability.
Empirical analyses demonstrate that careful design of masked attention and state abstractions yields performance comparable to traditional autoregressive methods in machine translation and recommendation tasks.

Markov-Chain Transformers designate a family of transformer models, training procedures, and analytical frameworks in which sequence prediction is constrained, factorized, or interpreted through Markov structure. In the literature, the term covers at least five distinct but related ideas: finite-order decoder models whose next token depends only on a bounded output history; transformers whose self-attention is exactly re-expressed as a context-conditioned Markov chain; transformers trained to estimate unknown transition kernels in context; architectures that force prediction to pass through an explicit intermediate “state” such as Chain-of-Thought; and interpretability frameworks that extract coarse Markovian state dynamics from hidden activations (Du et al., 2024, Ildiz et al., 2024, Lepage et al., 5 Aug 2025, Viteri et al., 2024, X, 20 May 2026).

1. Major meanings of the term

The phrase is not standardized. Some papers use it for models with an explicit finite-order Markov property on the decoder side, some for bounded-order CRF-style generation, some for latent-state or bottleneck factorizations, and some for analytical equivalences between self-attention and Markov processes. What unifies these uses is the attempt to replace unrestricted full-history dependence by a state variable, bounded context, or state-transition abstraction that is sufficient for prediction.

Usage in the literature	Core Markov object	Representative paper
Finite-order decoder	$P(y_n \mid X, y_{<n}) = P(y_n \mid X, y_{n-k:n-1})$	MAT (Du et al., 2024)
Bounded-context CRF decoder	Order- $M$ local potentials over target spans	Markov Transformer (Deng et al., 2020)
Self-attention as Markov dynamics	Context-conditioned Markov chain over tokens	CCMC (Ildiz et al., 2024)
In-context transition estimation	Online estimation of a transition matrix from context	ICL estimation (Lepage et al., 5 Aug 2025)
Explicit natural-language state	$q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ bottleneck	Markovian CoT (Viteri et al., 2024)
Hidden-state abstraction	Belief-like state transitions in activations	MCT (X, 20 May 2026)

A broad synthesis is that “Markov-Chain Transformer” names a research program rather than a single architecture. The common question is whether transformer computation can be understood as operating on a sufficient state—explicit, implicit, learned, or extracted—rather than on unrestricted raw history.

2. Architectures that enforce finite-order Markov dependence

The clearest architectural use appears in machine translation. The Markov Autoregressive Transformer (MAT) imposes a $k$ -th order Markov property on the decoder,

$P(y_n \mid X, y_1,\dots,y_{n-1}) = P(y_n \mid X, y_{n-k},\dots,y_{n-1}),$

while keeping the encoder unchanged. A naive local attention mask is not enough, because information can leak across layers. MAT therefore combines a $k$ -order attention mask with Transparent Attention, in which keys and values of previous tokens are static word embeddings rather than contextualized states. This yields a strict finite-order decoder. On WMT14 En→De, MAT with $k=5$ reaches 27.5 BLEU versus 27.8 for the full autoregressive transformer; on WMT14 De→En it reaches 31.0 versus 31.3; and on WMT17 En→Zh and Zh→En it reaches 33.9 and 23.3, respectively. The same study reports that order larger than 4 is already “on par” with conventional autoregressive transformers, and that higher order does not specifically help longer sentences (Du et al., 2024).

A related but distinct line is the “Markov transformer” for cascaded text generation. Here the decoder is trained with hard barriers every $M+1$ tokens so that self-attention cannot cross barriers, and the same network parameterizes a family of bounded-order CRFs $P^{(m)}$ for $m=0,\dots,M$ . Decoding proceeds by a cascade of increasingly higher-order models with max-marginal pruning and TreeMM, giving $M$ 0 parallel time for the dynamic-programming core. On WMT14 En→De with knowledge distillation, the paper reports 26.90 BLEU at $M$ 1 speedup and 26.52 BLEU at $M$ 2 speedup relative to an autoregressive transformer at 27.41 BLEU (Deng et al., 2020).

A third decoder-oriented formulation treats an $M$ 3-gram-structured transformer as a finite-state Markov chain over context windows and improves greedy decoding via rollout. Exact most-likely sequence generation is intractable, but rollout approximates dynamic programming by evaluating candidate next states using future continuations under a base policy. On small synthetic chains, the reported recovery of the greedy optimality gap is typically in the $M$ 4– $M$ 5 range, and the method also improves sequence probability for a GPT-based model when restricted to top-10 candidate next tokens (Li et al., 2024).

3. In-context estimation, induction heads, and staged learning on Markov data

A central theoretical use of Markov-chain tasks is to isolate how transformers learn in-context statistical estimators. In “The Evolution of Statistical Induction Heads,” each training sequence is generated by a fresh Markov chain whose rows are drawn from a Dirichlet prior. The Bayes-optimal predictor is the posterior-mean bigram estimator,

$M$ 6

and trained transformers empirically pass through three phases: uniform predictions, a unigram phase that ignores adjacency, and a rapid transition to the bigram solution. The learned mechanism is a pair of statistical induction heads: the first layer represents the previous token, and the second layer aggregates tokens that followed the same previous token in the context (Edelman et al., 2024).

Subsequent work extended this picture to higher-order sources. “Transformers on Markov Data: Constant Depth Suffices” reports that a 2-layer, 1-head transformer can learn order- $M$ 7 sources and a 3-layer, 1-head transformer can learn order- $M$ 8 sources when trained sufficiently long. The same paper proves that a 3-layer, 1-head transformer with relative position encodings and layer normalization can represent the in-context conditional empirical distribution for $M$ 9-th order Markov sources, and that an attention-only transformer needs $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 0 layers to do the same by composing induction heads (Rajaraman et al., 2024).

The depth requirement was tightened further in “What One Cannot, Two Can,” which shows that a 2-layer transformer with one head per layer can represent any conditional $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 1-gram. The construction relies on relative positional encodings, ReLU MLPs, and LayerNorm to build the context encodings needed for a $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 2-th order induction head. In that sense, the paper gives a depth-optimal shallow realization of arbitrary finite-order Markov next-token models by induction-head-like circuitry (Ekbote et al., 10 Aug 2025).

A complementary empirical meta-learning study trains decoder-only transformers on sequences from random transition matrices and shows a threshold phenomenon in model size and training-set diversity. For fixed $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 3, small $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 4 and large $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 5 lead to memorization, large $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 6 and small $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 7 lead to underfitting, and sufficiently large $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 8 with sufficiently large $q \rightarrow \mathrm{CoT} \rightarrow \mathrm{ans}$ 9 yields validation loss close to a Dirichlet-smoothed empirical estimator and attention patterns with induction-head behavior. Orthogonal per-sequence embeddings further improve robustness when the number of states or the Dirichlet parameter changes at test time (Lepage et al., 5 Aug 2025).

A more dynamical account appears in “Incremental Learning of Sparse Attention Patterns in Transformers.” There, a transformer trained on a high-order Markov chain task acquires sparse attention patterns in stages: an initial competitive phase in which heads all learn the statistically dominant pattern, followed by a cooperative phase in which heads specialize in distinct patterns. The paper models this via simplified differential equations, proves stage-wise convergence results, and argues that early stopping biases the model toward simpler, lower-order hypothesis classes (Yüksel et al., 22 Feb 2026).

4. Explicit Markovian bottlenecks and latent-state factorizations

One strand of work makes the Markov state explicit in natural language. “Markovian Transformers for Informative Language Modeling” introduces a Markovian Moore Machine view in which next-observation prediction is forced to factor only through an intermediate Chain-of-Thought channel. In the question-answer specialization, the factorization is

$k$ 0

with the crucial restriction that the evaluator $k$ 1 is trained and evaluated using only the CoT, never the original question. The training objective is “informativeness,” defined as improvement in answer log-likelihood relative to a baseline CoT. In practice the evaluator is frozen, the policy is trained with PPO, and perturbation tests show that as training progresses, truncation or corruption of the CoT hurts answer likelihood more strongly. Cross-model evaluation with an untrained Llama-2-7B Instruct evaluator shows that the learned CoTs remain useful outside the generating model, supporting the claim that the CoT has become a causal Markov bottleneck rather than epiphenomenal text (Viteri et al., 2024).

A more general probabilistic reinterpretation appears in “What can we learn from signals and systems in a transformer?” That paper introduces a latent state process $k$ 2 and interprets transformer signals as surrogates of conditional measures,

$k$ 3

with next-token probabilities written as

$k$ 4

In the HMM specialization, the posterior sequence $k$ 5 is characterized as a fixed point of a nonlinear map $k$ 6, and transformer layers are interpreted as approximate fixed-point updates toward that posterior. This reframes a Markov-Chain Transformer as an inference architecture whose hidden states approximate belief states over a latent Markov process (Chang et al., 27 Aug 2025).

5. Exact Markov reinterpretations and internal state dynamics

The most explicit equivalence between self-attention and a Markovian generative process is given in “From Self-Attention to Markov Models.” For a 1-layer self-attention model with suitable output head, the next-token distribution is exactly a context-conditioned Markov chain (CCMC). Given prompt $k$ 7, with empirical frequency vector $k$ 8, and base transition matrix $k$ 9, the next-token law is

$P(y_n \mid X, y_1,\dots,y_{n-1}) = P(y_n \mid X, y_{n-k},\dots,y_{n-1}),$ 0

With positional encodings, this becomes a position-weighted variant in which occurrences are scaled by position-dependent factors. The paper also proves identifiability of the underlying base matrix from IID prompt-output pairs under a connectivity condition on co-occurrence graphs, and analyzes single-trajectory generation, where the CCMC can become non-mixing and exhibit a winner-takes-all collapse onto a limited subset of tokens, offered as a mathematical explanation for repetitive text generation (Ildiz et al., 2024).

A distinct interpretability program extracts coarse state-transition structure directly from hidden activations. “Markovian Circuit Tracing for Transformer State Dynamics” trains tiny causal transformers on synthetic HMM families and studies whether their residual streams contain belief-like predictive states. Across 18 runs, the models achieve mean excess loss over Bayes of 0.0138. K-means state abstractions recover coarse transition signal, strongest in persistent and easy-separable regimes. The strongest causal test is state forcing: patching a recovered-state centroid into the second post-residual layer reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, outperforming wrong-state, mean-activation, random-activation, and shuffled-label controls. The result is not full hidden-state recovery, but it supports the claim that transformer activations can implement causally meaningful Markovian state abstractions (X, 20 May 2026).

A more geometric reinterpretation appears in “The Brownian motion in the transformer model.” Under layer normalization, tokens are treated as points on a hypersphere, and the scaled dot-product attention matrix is interpreted as a row-stochastic Markov transition matrix for a random walk on that sphere. In that view, stacked self-attention approximates diffusion or Brownian motion on a learned manifold, and the paper uses this perspective to motivate an iterative K-FAC second-order optimizer for MHSA (Chen, 2021).

6. Applications, empirical regimes, and limitations

The most developed application to a non-language domain is next-item recommendation. “Markovian Pre-Trained Transformer for Next-Item Recommendation” argues that advanced sequential recommenders are empirically “Markovian” in the sense that the latest interaction dominates next-item prediction while older interactions primarily encode non-sequential user identity. MPT is a 4-layer, 2-head, hidden-size-256 Llama-style decoder pre-trained on synthetic Markov chains whose rows are drawn from a Dirichlet prior and whose states are represented by random orthogonal vectors per trajectory. The next-state pre-training objective is standard autoregressive cross-entropy, and the implied Bayes estimator is the Dirichlet posterior mean

$P(y_n \mid X, y_1,\dots,y_{n-1}) = P(y_n \mid X, y_{n-k},\dots,y_{n-1}),$ 1

The reported system is then adapted to five public datasets from three platforms by training only a lightweight adaptor, and the paper attributes its transferability to two capabilities acquired during Markovian pre-training: estimating transition probabilities from context and attending strongly to the last state (Xu et al., 13 Jan 2026).

At the same time, work on synthetic Markov data shows that Markov alignment is not automatic. “Attention with Markov” proves that for a binary first-order source, a single-layer decoder-only transformer can exactly represent the true transition kernel and achieve the entropy rate, but with weight tying and switching factor $P(y_n \mid X, y_1,\dots,y_{n-1}) = P(y_n \mid X, y_{n-k},\dots,y_{n-1}),$ 2 the loss landscape contains a bad local minimum corresponding to constant prediction of the stationary marginal. Removing weight tying turns that point into a saddle. For higher-order chains, unrestricted causal attention fails even for deeper models, whereas a restricted mask window enables learning of the true transition structure; too large a window again drives the model toward marginal prediction. This suggests that, on Markovian data, optimization geometry and attention horizon can matter as much as representational capacity (Makkuva et al., 2024).

Taken together, these results delimit both the promise and the boundaries of the Markov-chain viewpoint. It is most successful when the relevant predictive state is low-order, compressible, or explicitly recoverable: finite-order translation decoders, in-context $P(y_n \mid X, y_1,\dots,y_{n-1}) = P(y_n \mid X, y_{n-k},\dots,y_{n-1}),$ 3-gram estimation, HMM-like belief tracking, recommendation settings dominated by the latest interaction, and explicit CoT bottlenecks. It becomes more fragile when the state is hard to estimate, when rewards are spiky, when context is too long or too unconstrained, or when the model can settle into shortcut solutions such as stationary-marginal prediction, repetition-inducing CCMC dynamics, or non-faithful explanations (Viteri et al., 2024, Ildiz et al., 2024, Makkuva et al., 2024).

In this sense, Markov-Chain Transformers are best understood not as a single architecture but as a set of formalizations for making transformer computation state-based, finite-memory, or state-interpretable. Their significance lies in providing precise objects—transition kernels, sufficient statistics, conditional measures, belief states, or causal bottlenecks—against which transformer behavior can be proved, measured, or engineered.