Autoregressive Transformers with Continuous Tokens

Updated 5 April 2026

Autoregressive Transformers with Continuous Tokens are sequence models that operate directly on high-dimensional, continuous token representations, bypassing discrete bottlenecks.
They leverage continuous latent spaces from methods like VAEs and diffusion autoencoders, adapting Transformer architectures with specialized embeddings, attention mechanisms, and loss functions.
This paradigm enhances generative performance in visual, audio, language, and motion domains while addressing challenges in scalability and stability inherent to discrete tokenization.

Autoregressive Transformers with Continuous Tokens are a paradigm for generative modeling in which sequence models eschew discretization bottlenecks and instead operate directly over vector-valued, high-dimensional continuous token representations. Originating as a response to both the information loss and scalability limits inherent to vector quantization and discrete tokenization, this approach leverages continuous latent spaces produced by advanced tokenizers (e.g., VAEs or diffusion autoencoders) and adapts the Transformer architecture—traditionally designed for categorical token prediction—to model the conditional distributions and generation dynamics of continuous-valued sequences. Applications span visual, audio, language, motion, and hybrid domains, with strong evidence for improved fidelity, richer spectrum of editability, and superior scaling compared to both discrete and diffusion-dominated baselines.

1. Continuous Tokenization and Sequence Construction

The core departure from discrete-token transformers is the tokenizer, which maps data into a sequence of continuous vectors. In vision, for example, a VAE or diffusion autoencoder encodes each 256×256 RGB image into a 32×32 grid with 4 or more floating-point channels per location, yielding T = 256 tokens, each a vector in ℝ^d (d typically 4–16), e.g., cₜ∈ℝ¹⁶ (Fan et al., 2024). Patch aggregation (such as 2×2) is commonly employed to reduce sequence length while maintaining spatial information. For text, continuous embedding spaces or learned latent codes support related formations; for audio and motion, analogous latent-space encoders produce frame- or window-wise continuous tokens (Li et al., 12 Feb 2026, Yang et al., 14 Jul 2025).

These continuous codes are projected via a learnable linear transformation into the model’s hidden space: $x_t = W_e\,c_t + b_e,\quad W_e \in \mathbb{R}^{d_{\text{model}} \times d},\;b_e \in \mathbb{R}^{d_{\text{model}}}$ The resulting sequence of vectors serves as inputs for the autoregressive Transformer, preserving finer structure and information compared to discrete tokens.

2. Transformer Architecture Adaptations

The base model is typically a decoder-only Transformer but with important modifications to accommodate continuous tokens:

Input Embedding: All continuous tokens are linearly projected (sometimes with normalization) into the Transformer’s hidden space.
Attention Mechanisms: Variants include causal (raster order), flexible bidirectional (random order or masked), block-wise, or frequency/progressive ordering. The random order configuration uses bidirectional attention, allowing unmasked (observed) tokens to mutually attend and masked (to-be-predicted) tokens to attend only to the unmasked set at every iteration (Fan et al., 2024, Yu et al., 7 Mar 2025).
Cross-Modal Conditioners: For text-conditional tasks, cross-attention to language embeddings, often from frozen LLMs (e.g., T5-XXL in Fluid), is standard.
Efficiency and Locality: Innovations such as linear attention, depthwise convolution, gating mechanisms (KV gates), and block-wise context are incorporated for scalability and spatial locality, notably in LINA (Wang et al., 30 Jan 2026).

The architecture remains structurally similar to language transformers but with modality-specific heads (e.g., per-token diffusion or flow-matching MLPs) for continuous value prediction.

3. Training Objectives for Continuous Token Prediction

For continuous tokens, traditional categorical prediction is replaced by objectives suited for vector-valued targets:

Denoising Objectives: Diffusion-based token prediction is prevalent. At training, each continuous token is noised by a schedule (e.g., √α(t)xₜ + √(1−α(t))ε), and the model predicts the added noise, minimizing the L₂ loss:

$\mathcal{L} = \mathbb{E}_{c,\,t,\epsilon} \left\| \epsilon_\theta(x_t^\text{noised}, t, \text{context}) - \epsilon \right\|_2^2$

(Fan et al., 2024, Yu et al., 7 Mar 2025, Yang et al., 14 Jul 2025)

Flow-Matching: Predicting the instant “velocity” of denoising trajectories for each token, particularly in audio/motion and high-fidelity image generation (Li et al., 12 Feb 2026, Team et al., 14 Aug 2025).
Proper Scoring Rules: Energy score minimization (EAR), Hyvärinen score (diffusion), and log-likelihood (GIVT/flow-based) enable likelihood-free or implicit sampling with theoretical guarantees of distributional consistency (Shao et al., 12 May 2025).
Hybrid Losses: For multimodal or hybrid architectures (e.g., UniFluid, AGDC), both cross-entropy (for discrete sub-sequences) and per-token continuous losses are combined, often with a balancing coefficient.

Regularization (e.g., weight decay, EMA) is employed to stabilize dense, continuous-space training. For domain-specific settings, auxiliary heads (exit or length regularizers) enable precise sequence termination, especially in variable-length generation (Shin et al., 9 Jan 2026).

4. Generation Strategies and Ordering

Autoregressive factorization over continuous tokens can follow several orderings:

Raster/Causal Order: Standard left-to-right generation, where each token depends only on its predecessors. Limitation: previously generated tokens are irrevocable, leading to issues in spatial coherence for images.
Random/BERT-style Order: At each iteration, a random permutation determines the prediction order; masked tokens are predicted based on all available context, providing bidirectional context within each generation step. This bidirectionality allows revision and improved global consistency, especially for vision and multimodal generation (Fan et al., 2024, Fan et al., 17 Mar 2025).
Blockwise/Hierarchical: Tokens are grouped into blocks or predicted at coarse-to-fine (e.g., low- to high-frequency) granularity. FAR (Yu et al., 7 Mar 2025) employs frequency bands as the AR axis, and blockwise schemes improve computational parallelism without sacrificing causality (Zhang et al., 1 Jul 2025).
Fusion with Discrete Tokens: Hybrid approaches condition continuous generation on pre-selected discrete “modes” (e.g., object categories), as in D2C, DisCon, and AGDC, which ensures global structure is determined before fine-grained detail (Wang et al., 21 Mar 2025, Zheng et al., 2 Jul 2025, Shin et al., 9 Jan 2026).

5. Empirical Performance and Comparative Analysis

Autoregressive Transformers with continuous tokens substantially close the gap to, and in some scenarios surpass, diffusion models and discrete AR systems in several metrics:

Model	Modality	Params	Inference FID↓ (ImageNet 256)	GenEval (COCO)	Highlights
Fluid (rand, cont.)	Image	10.5B	6.16	0.69	State-of-the-art among AR T2I (Fan et al., 2024)
NextStep-1	Image	14B	≈6.2 (COCO 30k)	0.73	Unified, strong editing (Team et al., 14 Aug 2025)
LINA-H	Image	1.4B	2.18	0.66	Linear attention, efficient (Wang et al., 30 Jan 2026)
D2C-L (q-former)	Image	633M	3.14	–	Outperforms discrete/cont. alone (Wang et al., 21 Mar 2025)
DisCon-L	Image	558M	1.38 (gFID)	–	Conditional continuous AR (Zheng et al., 2 Jul 2025)
EAR-H	Image	937M	1.97	–	Single-pass, scoring rule AR (Shao et al., 12 May 2025)
LLaMo-3B	Motion	3B	FID 22.5 (text→motion)	–	Streaming, no quantization artifact (Li et al., 12 Feb 2026)
ARDiT (B=4, cont.)	Audio	–	–	–	170ms/step, near-perfect speech editing (Liu et al., 2024)

Larger models and random-order generation further improve results, with validation loss scaling roughly as a power law with model size for continuous-token AR (Fan et al., 2024). For hybrid and fusion models, conditioning on discrete modes dramatically improves stability and sample quality (Zheng et al., 2 Jul 2025, Wang et al., 21 Mar 2025). In multimodal settings, unified models such as UniFluid and AGDC can simultaneously handle text, vision, and vector data, maintaining or improving performance on both discrete and continuous subspaces (Fan et al., 17 Mar 2025, Shin et al., 9 Jan 2026).

6. Theoretical and Practical Challenges

Modeling continuous tokens poses density estimation and OOD risk challenges: the support is unbounded, and a naive AR chain over continuous space can result in mode collapse or artifact generation (Zheng et al., 2 Jul 2025, Wang et al., 21 Mar 2025). Several architectural and objective innovations address these issues:

Diffusion/Flow Heads: Implicitly parameterize token distributions to avoid explicit pdf estimation, as explicit normalizing flows or Gaussian heads (Fan et al., 2024, Li et al., 12 Feb 2026, Team et al., 14 Aug 2025).
Conditional Hybridization: Conditioning on discrete, high-level structure stabilizes continuous AR and enables efficient, accurate sample reconstruction (Zheng et al., 2 Jul 2025).
Attention and Memory Management: Linear attention, locality augmentation, and gating ensure tractable scaling for high-resolution generations (Wang et al., 30 Jan 2026).
Training Schemes: Random-order generation, blockwise AR, and curriculum learning all play roles in stabilizing learning and optimizing convergence (Fan et al., 2024, Yu et al., 7 Mar 2025, Zhang et al., 1 Jul 2025).
Objective Formulation: Proper scoring rule objectives (EAR, Hyvärinen, log-likelihood) offer likelihood-free or amortized viable training (Shao et al., 12 May 2025, Zhang et al., 1 Jul 2025).

7. Extensions, Limitations, and Future Directions

Autoregressive Transformers with continuous tokens are being extended to new domains:

Video and 3D: Fluid, LINA, and related frameworks explicitly propose extending to spatiotemporal continuous tokens and higher-dimensional shapes, though challenges in sequence length and context modeling remain (Fan et al., 2024, Wang et al., 30 Jan 2026).
Hybrid Discrete-Continuous and Multimodal: Hybrid models (DisCon, D2C, AGDC) showcase improved sample quality, stability, and inference efficiency by leveraging both discrete and continuous token spaces (Wang et al., 21 Mar 2025, Zheng et al., 2 Jul 2025, Shin et al., 9 Jan 2026).
Language, Motion, and Audio: Continuous latent representations have been successfully exploited in language modeling (TarFlowLM, Token Maturation), motion-language integration (LLaMo), and audio (ARDiT, AudioNTP), with benefits for editability, streaming, and fidelity (Li et al., 12 Feb 2026, Yang et al., 14 Jul 2025, Liu et al., 2024, Zhang et al., 1 Jul 2025, Naparstek, 8 Jan 2026).
Scaling and Compute Efficiency: Linear attention, blockwise AR, gating, and learned permutation schedules are critical to scaling continuous-token AR to extreme sequence lengths and large models (Wang et al., 30 Jan 2026, Zhang et al., 1 Jul 2025, Fan et al., 2024).
Sampling and Inference: Speedups via distillation (e.g., distilling multi-step flow-matching teachers into single-step students), blockwise prediction, and hybrid decoding improve practical applicability for real-time use cases (Liu et al., 2024, Li et al., 12 Feb 2026).

Persisting challenges include stability of unbounded density modeling, efficient long-sequence generation, and further acceleration of diffusion/flow-based sampling in AR contexts. There is active research into learned orderings, blockwise generation, and integration with strong pre-trained LLM or visual backbones (Fan et al., 2024, Yu et al., 7 Mar 2025, Fan et al., 17 Mar 2025).

References:

"Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens" (Fan et al., 2024)
"NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale" (Team et al., 14 Aug 2025)
"LINA: Linear Autoregressive Image Generative Models with Continuous Tokens" (Wang et al., 30 Jan 2026)
"Continuous Visual Autoregressive Generation via Score Maximization" (Shao et al., 12 May 2025)
"Frequency Autoregressive Image Generation with Continuous Tokens" (Yu et al., 7 Mar 2025)
"Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis" (Zheng et al., 2 Jul 2025)
"D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens" (Wang et al., 21 Mar 2025)
"Autoregressive Diffusion Transformer for Text-to-Speech Synthesis" (Liu et al., 2024)
"Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction" (Yang et al., 14 Jul 2025)
"LLaMo: Scaling Pretrained LLMs for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens" (Li et al., 12 Feb 2026)
"Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics" (Naparstek, 8 Jan 2026)
"Unified Autoregressive Visual Generation and Understanding with Continuous Tokens" (Fan et al., 17 Mar 2025)
"AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces" (Shin et al., 9 Jan 2026)
"Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows" (Zhang et al., 1 Jul 2025)
"Mixture of Tokens: Continuous MoE through Cross-Example Aggregation" (Antoniak et al., 2023)