Transformer Next-Token Prediction

Updated 5 January 2026

Transformer-based next-token prediction architectures are neural network models that generate the following token via autoregressive, decoder-only attention, extending classical Wiener predictors.
They employ nonlinear compositions of past tokens to achieve probabilistic filtering and establish capacity bounds with injective self-attention layers governing context memorization.
Each layer contributes a fixed multiplicative reduction in prediction error, guiding depth design and inspiring advances in formal language theory and multimodal extensions.

Transformer-based next-token-prediction architectures are a class of neural network models that compute the conditional distribution over the following token in a discrete sequence, given a context of past tokens. Modern LLMs predominantly implement this framework via autoregressive, decoder-only transformers. Over the last several years, both the mathematical underpinnings and empirical expressivity limits of these architectures have been closely examined, leading to a range of insights connecting signal-processing theory, probabilistic filtering, formal language theory, information capacity, and practical training motifs.

1. Nonlinear Prediction and Generalized Filtering

The transformer as a next-token predictor generalizes the classical Wiener linear predictor to nonlinear, high-dimensional, and discrete settings. In the Wiener framework, the optimal estimator for the next value of a Gaussian process is a linear combination of past observations. A transformer extends this by using nonlinear compositions of all previous tokens through the architecture's attention mechanism, with each "attention head" parametrizing the adaptive and context-dependent weights for the combination (Chang et al., 27 Aug 2025).

Formally, for tokens drawn from a finite alphabet $O$ and embedded as indicator vectors $e(z) \in \mathbb{R}^m$ , the next-token distribution can be written as

$P[Z_{T+1}=z\mid Z_{1:T}] = \mathrm{const} - \sum_{t=0}^{T-1} U_t^\top e(Z_{t+1}) \quad \text{for } z \in O,$

where $U_t$ is a sequence of $\sigma$ -algebra-adapted weight processes concretely realized by the transformer attention mechanism.

From a probabilistic modeling perspective, internal activations of the transformer serve as surrogates for filtering distributions over latent states, analogous to the posterior measures in a hidden Markov model (HMM) filter. Specifically, the top-layer activation at each position $\sigma_t^{(L)}$ approximates the filtering distribution $\pi_t(x) = P[X_t = x \mid Z_{1:t}]$ , and the softmax head corresponds to an un-embedding into output token probabilities via a learned matrix $\widetilde{C}$ (Chang et al., 27 Aug 2025). Thus, the transformer's layerwise updates can be interpreted as fixed-point iterations approximating nonlinear filtering, with closed-form characterizations in the HMM regime.

This signals-and-systems view maps transformer layer operations onto iterative filtering updates, with residual and normalization operations analogized to stabilization and convergence-improving techniques found in classical iterative algorithms.

2. Expressivity, Capacity, and Injectivity

The next-token prediction capacity of a transformer refers to the maximal number $n$ of distinct context-to-next-token-distribution mappings that can be interpolated by a $k$ -parameter model. Theoretical upper and lower bounds coincide up to constants, establishing that for a vocabulary of size $\omega$ ,

$n \leq \frac{k}{\omega - 1} \qquad \text{(upper bound)},$

with one-layer transformer constructions achieving $n = \Omega\left( \frac{k}{\omega - 1} \right)$ (Madden et al., 2024).

The proofs expose a fundamental property of self-attention: even in degenerate cases (e.g., embedding dimension $d=1$ , single head), self-attention layers are injective with respect to context. This property enables parameter-efficient memorization and interpolation of arbitrary context→distribution mappings up to the information-theoretic bound. Empirically, cross-entropy minimization approaches the entropy of the ground-truth mapping as soon as width (i.e., feed-forward sublayer neurons) matches $n$ .

Architecturally, increasing embedding size or the number of heads alone does not improve pure context-memorization capacity; only the width of the feedforward sublayer is critical. For next-token interpolation power, deeper or more complex networks provide no additional benefit once optimal FNN width is attained (Madden et al., 2024).

3. Layerwise Mechanisms and Learning Laws

An empirical law observed in large-scale pre-trained transformer architectures is the equal contribution of each layer to reducing next-token prediction residual (prediction residual, PR, defined via the best linear regression from hidden states to next token) (He et al., 2024). Quantitatively, denoting the residual at layer $l$ by PR $_l$ ,

$\operatorname{PR}_l \approx \rho^{l-1} \operatorname{PR}_1,$

where $0 < \rho < 1$ is a constant factor per layer (downward slope in log-space). This "law of equi-learning" is universal across families of transformers, RWKV, and Mamba models, across scaling dimensions and diverse datasets.

Each block—attention, MLP, normalization—thus provides, on average, a fixed multiplicative reduction in prediction error, guiding both architectural choices (trading depth $L$ against per-layer improvement $\rho$ ) and learning-rate schedules (balancing per-layer gradient magnitudes to maintain this progression). Notably, this law fails to emerge under non-autoregressive objectives such as masked language modeling or span corruption.

4. Theoretical Universality and Algorithmic Implementations

The approximation capacity of transformer next-token prediction models extends far beyond shallow in-context mimicking. For any context-dependent function $f$ , mappings $x_{t+1} = f(x_{1:t})$ are theoretically attainable by transformer architectures of sufficient width and depth, provided $f$ is linear or periodic and polynomial resource scaling is allowed (Sander et al., 2024). The construction leverages a correspondence between transformer layers and iterative solution steps for the normal equations (via "causal kernel descent"), closely connected to the Kaczmarz algorithm in Hilbert spaces.

Explicit layerwise constructions demonstrate that, for linear $f$ , a $t$ -layer transformer can exactly solve kernel ridge regression over the context, and periodic mappings are handled by initial Fourier feature lifting.

These results formalize the universality of the transformer next-token paradigm for sequential function approximation and in-context learning of algorithms previously thought to require explicit recurrence.

5. Connections to Formal Language Theory

Beyond their signal-processing and probabilistic interpretations, transformer-based next-token predictors have been rigorously connected to formal language theory—specifically, the class of left context-sensitive grammars (left-CSGs) (Rhee, 15 Apr 2025). Under this view, the autoregressive next-token prediction process is equivalent to a stochastic approximation of left-CSG derivations.

Each generation step is mapped to a left-CSG production rule $aA \to aR$ , where $a$ is the left context (fixed-length sequence of previous tokens), $A$ a nonterminal representing the internal state, and $R$ the rewritten right-hand side. The attention mechanism conducts a soft, graded context check, generalizing the hard context constraints of classical grammars. Via Penttonen's equivalence theorem, transformers are universal generators for context-sensitive languages—a formal foundation for their observed expressive power in human-like sequence modeling.

6. Architectural Innovations and Specialized Variants

Multiple studies have proposed augmentations to standard transformer-based next-token architectures to address specific practical limitations:

Penultimate-token prediction and “generate-then-refine”: Combining standard autoregressive models with bidirectionally-informed penultimate-token predictors yields improved next-token accuracy via lightweight self-assessment steps at inference (Schneider, 2024).
Semantic planning (Semformer): Augmenting the input with planning tokens supervised to predict latent semantic codes (via an auxiliary autoencoder) enables near-perfect lookahead planning and reduces shortcut fitting intrinsic to naive teacher forcing (Yin et al., 2024).
Latent state prediction (NextLat): Adding a transition loss that supervises the prediction of future latent states injects a recurrent inductive bias, guiding the transformer toward belief state representations and improving world modeling and lookahead consistency without sacrificing standard next-token performance (Teoh et al., 8 Nov 2025).
Encoder-only next-token prediction (ENTP): Removing causal masking inside blocks and enforcing causality only via externally supplied prefixes allows a transformer encoder to compute higher-order combinatorial functions not accessible to fixed-depth decoder-only models, at the cost of increased computational complexity (Ewer et al., 2024).
Multimodal and domain-adaptive extensions: Architectures such as Emu3 demonstrate pure next-token prediction applied to discretized multimodal streams (text, images, video) with state-of-the-art generation and perception across modalities (Wang et al., 2024); in robotics, event streams, and protein modeling, domain-specialized next-token predictors facilitate direct in-context imitation (Fu et al., 2024, Karpukhin et al., 2 Aug 2025, Pourmirzaei et al., 26 May 2025).

7. Implications, Constraints, and Design Principles

The next-token paradigm's foundations in adaptive signal weighting, probabilistic filtering, formal language universality, and explicit information-theoretic limits supply a blueprint for transformer architecture design:

Layer norms and residual pathways stabilize fixed-point iterations in the surrogate filtering dynamics (Chang et al., 27 Aug 2025).
Feedforward sublayer width is the main determinant of context-memorization capacity (Madden et al., 2024).
Per-layer contribution to prediction improves strictly multiplicatively with depth under consistent training and normalization (He et al., 2024).
Explicit modularity enables hybrid, task-specialized designs where analytic filtering, explicit planning, or multi-modal conditioning are interleaved with learned attention mechanisms (Chang et al., 27 Aug 2025, Teoh et al., 8 Nov 2025, Wang et al., 2024).
Limitations include the lack of incentivized compact state summarization (addressed by NextLat), tradeoffs in computation vs. expressivity (decoder vs. encoder-only architectures), and the challenges of learning long-horizon dependencies or handling ambiguous context with only causal structure.

A plausible implication is that future transformer-based next-token prediction architectures will increasingly blend analytic inference steps, non-causal context integration, and modular hybridization with specialized auxiliary losses to optimize for both expressivity and sample efficiency. The architectural and mathematical principles established above provide the technical scaffolding for continued progress in both theory and application of next-token-predictive neural sequence models.