Next-Token Generators: Advances & Applications

Updated 4 July 2026

Next-token generators are autoregressive models that iteratively predict tokens conditioned on their preceding context, applicable across diverse modalities.
They employ teacher-forced training and autoregressive inference, addressing challenges like error accumulation while ensuring causal conditioning.
Advances in representation and objective design extend their use to multimodal, hybrid, and continuous settings, enhancing both versatility and inference efficiency.

Next-token generators are autoregressive models that represent a sequence distribution by repeatedly predicting the next element conditioned on the realized prefix. In its canonical discrete form, this is the chain-rule factorization

$p(x)=\prod_{t=1}^{T} p(x_t\mid x_{<t}),$

optimized by next-token negative log-likelihood and executed by left-to-right decoding. Recent work extends the same principle far beyond text: discrete image latents, multimodal token streams, semantic item identifiers, jet constituents, source-space MEG, and continuous-valued audio tokens have all been cast as next-token generation problems. Across these settings, the defining property is causal conditioning rather than any commitment to language or to discrete vocabularies (Kilian et al., 2024, Wang et al., 2024, Yang et al., 14 Jul 2025).

1. Autoregressive factorization and the generator viewpoint

A next-token generator specifies a conditional distribution for each position and composes those conditionals into a full sequence model. In latent-image synthesis, the formulation is written as

$p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$

with training objective

$\mathcal{L}_{NT} = \mathbb{E}_i \left[-\log p(z_i \mid z_{<i}; \theta)\right].$

Inference starts from an empty sequence, implemented practically with a start token, and then samples tokens autoregressively until the latent grid is complete (Kilian et al., 2024). In recommendation, the same pattern appears over semantic item identifiers rather than words: a user history is flattened into a long token sequence, and the target item identifier is generated token by token (Chiu et al., 25 Jan 2026, Zheng et al., 6 Apr 2025).

A central operational distinction is between teacher-forced training and autoregressive inference. Under teacher forcing, the model is trained on ground-truth prefixes; during inference, it consumes its own outputs. This distinction is not merely pedagogical. It underlies both standard concerns about error accumulation and stronger claims that teacher forcing can fail to learn the correct predictor on tasks requiring lookahead (Bachmann et al., 2024). The generator viewpoint therefore includes both the probabilistic factorization and the rollout regime by which a learned conditional law becomes a sampler.

The same causal factorization can be generalized beyond discrete symbols. In continuous audio generation, AudioNTP models

$p(a \mid w)=p(x_1,\dots,x_n\mid w)=\prod_{i=1}^{n} p(x_i\mid x_{<i},w),$

where $x_i\in\mathbb{R}^h$ are continuous latent tokens and each conditional is modeled by a diffusion head rather than a softmax over token IDs. This makes explicit that next-token generation is a statement about sequential conditionalization, not about categorical vocabularies (Yang et al., 14 Jul 2025).

2. Representation design: discrete tokens, hybrid inputs, and flattened streams

The practical behavior of a next-token generator is strongly determined by how data are represented. In multimodal models such as Emu3, text tokens, image tokens, and video tokens are placed into a shared discrete sequence. Images and videos are tokenized with SBER-MoVQGAN; a $512\times512$ image or a $4\times512\times512$ video clip becomes 4096 discrete tokens from a codebook of size 32,768, and the resulting sequence is embedded in a document-like format with modality markers such as [SOV], [SOT], [EOV], [EOL], and [EOF] (Wang et al., 2024). Medical referring image segmentation adopts a closely related sequence construction: image tokens, text tokens, and mask tokens are concatenated into a unified autoregressive stream so that mask prediction becomes next-token generation over multimodal tokens (Chen et al., 7 Nov 2025).

Jet foundation models expose a different representational issue. In the original OmniJet- $\alpha$ setup, jet constituents were tokenized with a VQ-VAE into discrete token-IDs $t_i$ , and the same token-IDs were used as inputs to both the generator and the downstream classifier. The enhanced formulation decouples these roles: continuous feature vectors $\vec{c}_i$ , or decoded pseudo-continuous vectors $p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 0, are used as model inputs, while token-IDs remain only as next-token targets. The paper characterizes this as a hybrid continuous-input / discrete-target setup and argues that it avoids classification penalties induced by tokenization artifacts while preserving autoregressive generation (Birk et al., 3 Dec 2025).

Generative recommendation adopts structured identifiers rather than raw item labels. In one line of work, items are mapped to fixed-length semantic ID sequences by RQ-VAE or PQ-based schemes, and next-item prediction becomes sequence generation over those IDs (Chiu et al., 25 Jan 2026). In UTGRec, a universal item tokenizer built from Qwen2-VL and tree-structured codebooks produces multi-code identifiers from multimodal item content, again shifting the generator from item labels to code sequences (Zheng et al., 6 Apr 2025). The same principle appears in source-space MEG, where BrainTokMix compresses multichannel signals into RVQ indices $p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 1, serializes the resulting 3D grid into a single sequence, and trains a decoder-only Transformer on the flattened stream (Csaky, 28 Jan 2026).

These representation choices show that next-token generators are not tied to a single ontology of “token.” Tokens may be linguistic units, latent visual codes, semantic IDs, residual vector-quantization indices, or continuous latent vectors. A plausible implication is that the boundary between sequence modeling and modality modeling is increasingly set by tokenization design rather than by the generator architecture itself.

3. Objective design beyond vanilla next-token likelihood

Although the basic next-token objective is cross-entropy on the immediate successor token, recent work broadens this objective along several axes. In jet modeling, joint pre-training combines causal next-token prediction with masked particle modeling. The model uses two heads and two backbone forward passes: one causal pass for NTP and one bidirectional pass for MPM, with the total loss given by the sum of the two objectives. This construction is intended to preserve generative fidelity while injecting the contextual information needed for downstream classification (Birk et al., 3 Dec 2025).

Audio generation introduces a different extension. AudioMNTP randomly drops tokens from the input sequence and then trains the causal LLM to predict future continuous tokens from a visible subsequence, adding a target positional embedding so that the model knows which future position it is reconstructing. The paper presents this as masked next-token prediction inside a strictly causal framework and treats standard NTP as the special case obtained when the visible set contains all prior tokens (Yang et al., 14 Jul 2025). Diffusion Forcing generalizes the idea further by assigning each token its own diffusion noise level and training a causal denoiser over partially noised sequences. In this view, next-token generation and full-sequence diffusion are endpoints of a broader family of sequence denoising problems (Chen et al., 2024).

Segmentation and recommendation provide more task-specific extensions. NTP-MRISeg augments standard autoregression with Next-k Token Prediction, Token-level Contrastive Learning, and memory-based Hard Error Token optimization. The stated purpose is to reduce exposure bias, mitigate long-tail token distributions, and sharpen fine-grained lesion boundaries (Chen et al., 7 Nov 2025). Token-weighted multi-target learning for generative recommenders replaces uniform token importance with Front-Greater Weighting and Frequency Weighting, then combines both with standard likelihood through adaptive multi-target optimization and curriculum learning. The paper’s underlying claim is that semantic ID tokens carry unequal information, so next-token generators should weight them accordingly (Chiu et al., 25 Jan 2026).

Distillation work on multimodal LLMs identifies another limitation of plain next-token matching: static next-token KD under ground-truth prefixes does not capture token interactions. Align-TI therefore supplements vanilla KD with Instruction-aware Vision Alignment and Transition Probability Alignment, the latter aligning sequential token-to-token transition probabilities rather than only the current next-token distribution (Chen et al., 10 Feb 2026). Taken together, these developments suggest that “next-token generation” now names a family of causal objectives rather than a single one-step loss.

4. Inference, efficiency, and generator access

A recurrent empirical claim is that next-token generators are unusually efficient at inference. In a FLOPs-controlled comparison among diffusion, masked-token prediction, and next-token prediction for image synthesis, next-token prediction is reported as “by far the most efficient” at inference because, under KV caching, it requires only one forward pass per token sequence, whereas iterative denoising methods multiply forward cost by the number of steps. The same study finds that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following; on image quality, next-token prediction is initially stronger, but diffusion eventually catches up and can surpass it on FID at higher compute, especially with stronger autoencoders (Kilian et al., 2024).

This efficiency has motivated a substantial literature on decoding acceleration. LogitSpec replaces a learned draft model in speculative decoding with retrieval guided by the observation that the last-token logit can often speculate the next-next token. It retrieves continuations using both the sampled next token and speculated $p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 2 candidates, is training-free and plug-and-play, and reports up to $p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 3 speedup with 3.28 mean accepted tokens per decoding step. Its runtime breakdown attributes only 1.17% of decoding time to retrieval on one benchmark, indicating that the dominant cost remains the model forward pass (Liu et al., 2 Jul 2025). A smaller-scale refinement strategy trains a second decoder-only model to predict the second-to-last token and uses that model to rerank the top- $p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 4 next-token candidates in a generate-then-refine pipeline; second-to-last prediction is reported as more than 15% more accurate than standard next-token prediction, while the downstream gain in next-token accuracy is smaller but consistent and significant (Schneider, 2024).

Inference behavior also depends on what the learner is allowed to query. In autoregressive post-training, root-start rollouts, sampled-token log probabilities, top- $p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 5 reports, and full next-token distributions along sampled trajectories collapse to one canonical experiment, limited by the on-policy probability of reaching informative prefixes. Weak prefix control breaks that barrier, and once control is available, richer observations such as conditional sampling or logits can outperform top-1 access. The paper frames this as an exponential gap created purely by the generator interface in KL-regularized outcome-reward post-training (Rege, 6 Apr 2026). The result shifts part of the discussion from model architecture to access model: not all “next-token generator” APIs expose the same algorithmic leverage.

5. Cross-domain instantiations

Next-token generators now function as a cross-domain sequence modeling template.

Domain	Representation	Generator form
Images and video	Discrete vision tokens	Single decoder-only Transformer (Wang et al., 2024)
Jets	Continuous inputs, discrete token targets	Hybrid autoregressive jet model (Birk et al., 3 Dec 2025)
Audio	Continuous-valued latent tokens	Causal LM with diffusion head (Yang et al., 14 Jul 2025)
Recommendation	Semantic ID code sequences	Next-item autoregressive generator (Chiu et al., 25 Jan 2026)
Source-space MEG	Flattened RVQ brain-token stream	Decoder-only next-brain-token model (Csaky, 28 Jan 2026)
Medical segmentation	Unified image-text-mask token sequence	Autoregressive mask generator (Chen et al., 7 Nov 2025)

In multimodal generation, Emu3 is the clearest statement of the unification thesis. It trains a single decoder-only Transformer from scratch on mixed sequences of text, images, and videos, with standard next-token cross-entropy and no separate image encoder, diffusion decoder, or dedicated video model. The model is used for text continuation, image generation, video generation, multimodal understanding, and future video prediction, and the paper explicitly argues that once modalities are tokenized well enough, a single next-token generator can replace diffusion or compositional multimodal stacks (Wang et al., 2024).

In scientific and high-bandwidth domains, the same template is adapted rather than abandoned. The jet work emphasizes that next-token pre-training is simulation-free and transferable across datasets, provided the representation and objective are redesigned for downstream tagging (Birk et al., 3 Dec 2025). The MEG work shows that a Qwen2.5-VL-style decoder can be trained from scratch on flattened brain-token streams and generate minutes of MEG from up to a minute of context, with explicit long-horizon stability and prompt-specificity evaluations (Csaky, 28 Jan 2026). AudioNTP and AudioMNTP show that decoder-only next-token generation can remain causal and streamable even when each token is continuous and sampled through a diffusion process (Yang et al., 14 Jul 2025). In recommendation and segmentation, next-token generation serves less as unconditional synthesis than as structured prediction over learned codes, but the operational logic remains autoregressive token generation (Chiu et al., 25 Jan 2026, Chen et al., 7 Nov 2025).

6. Theory, criticisms, and alternatives

Theoretical work offers sharply different interpretations of what next-token generators learn. One formal-language account interprets Transformers not merely as next-token predictors but as stochastic generators of left context-sensitive languages, with each next-token step acting as a dynamic probabilistic approximation to a left context-sensitive production rule (Rhee, 15 Apr 2025). A complementary mechanistic analysis of a single self-attention layer trained for next-token prediction describes a two-stage process of hard retrieval and soft composition, where gradient descent recovers strongly connected components of a token-priority graph and attention retrieves tokens from the highest-priority SCC available in context (Li et al., 2024). At a more expansive level, auto-regressive next-token predictors trained on chain-of-thought data are argued to be universal learners: even linear next-token predictors can approximate functions computed by Turing machines, with “length complexity” measuring the number of intermediate tokens required by the computation trace (Malach, 2023). A separate complexity-theoretic result shows that optimizing next-token prediction over an RNN yields $p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 6-token indistinguishability against bounded next- $p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 7-token distinguishers, offering one account of why locally optimized models can still exhibit long-range coherence (Cao et al., 8 Dec 2025).

Critical and delimiting work qualifies these claims. One paper argues that the slogan “LLMs learn the next-token conditional distribution” is only conditionally correct because it conflates the full world-conditioned process, the marginal text-only law, and the model-induced distribution learned from finite corpora. In that account, next-token prediction is useful only when the observed prefix is an approximately sufficient statistic for the hidden circumstances that determine continuation, formalized by the criterion

$p(z) = \prod_{i=1}^{n} p(z_i \mid z_{<i}),$ 8

RAG and tool use are then interpreted as conditional sufficiency devices that reduce residual dependence on omitted state (Corielli, 22 May 2026). Another critique separates teacher-forced training from autoregressive inference and argues that, on lookahead tasks, teacher forcing itself can fail through “Clever Hans cheating” and “Indecipherable token” effects, motivating teacherless or multi-token objectives and target reversal as partial remedies (Bachmann et al., 2024).

The broader research program now includes explicit alternatives. A recent survey groups them into Multi-Token Prediction, Plan-then-Generate, Latent Reasoning, Continuous Generation Approaches, and Non-Transformer Architectures, framing poor long-term planning, error accumulation, and sequential inefficiency as persistent weaknesses of pure NTP (Wyatt et al., 29 Sep 2025). The contemporary field is therefore characterized less by consensus than by a structured tension: some works argue that next-token prediction, given sufficiently strong tokenization and scale, is an adequate unifying principle for multimodal intelligence (Wang et al., 2024), while others treat it as a locally powerful but globally incomplete objective that benefits from masking, multi-token targets, latent planning, retrieval, tools, or non-autoregressive refinement (Wyatt et al., 29 Sep 2025, Chen et al., 2024).

In that sense, next-token generators are best understood not as a single settled model class but as a general autoregressive paradigm. Its modern forms range from plain categorical language modeling to hybrid continuous/discrete generators, retrieval-augmented speculative decoders, multimodal unified token streams, and causal diffusion-based sequence models. The paradigm’s continued centrality comes from the breadth of this template; its continuing controversy comes from the limits of what any one-step local prediction objective can guarantee.