Papers
Topics
Authors
Recent
Search
2000 character limit reached

Autoregressive Transformers with Continuous Tokens

Updated 5 April 2026
  • Autoregressive Transformers with Continuous Tokens are sequence models that operate directly on high-dimensional, continuous token representations, bypassing discrete bottlenecks.
  • They leverage continuous latent spaces from methods like VAEs and diffusion autoencoders, adapting Transformer architectures with specialized embeddings, attention mechanisms, and loss functions.
  • This paradigm enhances generative performance in visual, audio, language, and motion domains while addressing challenges in scalability and stability inherent to discrete tokenization.

Autoregressive Transformers with Continuous Tokens are a paradigm for generative modeling in which sequence models eschew discretization bottlenecks and instead operate directly over vector-valued, high-dimensional continuous token representations. Originating as a response to both the information loss and scalability limits inherent to vector quantization and discrete tokenization, this approach leverages continuous latent spaces produced by advanced tokenizers (e.g., VAEs or diffusion autoencoders) and adapts the Transformer architecture—traditionally designed for categorical token prediction—to model the conditional distributions and generation dynamics of continuous-valued sequences. Applications span visual, audio, language, motion, and hybrid domains, with strong evidence for improved fidelity, richer spectrum of editability, and superior scaling compared to both discrete and diffusion-dominated baselines.

1. Continuous Tokenization and Sequence Construction

The core departure from discrete-token transformers is the tokenizer, which maps data into a sequence of continuous vectors. In vision, for example, a VAE or diffusion autoencoder encodes each 256×256 RGB image into a 32×32 grid with 4 or more floating-point channels per location, yielding T = 256 tokens, each a vector in ℝd (d typically 4–16), e.g., cₜ∈ℝ¹⁶ (Fan et al., 2024). Patch aggregation (such as 2×2) is commonly employed to reduce sequence length while maintaining spatial information. For text, continuous embedding spaces or learned latent codes support related formations; for audio and motion, analogous latent-space encoders produce frame- or window-wise continuous tokens (Li et al., 12 Feb 2026, Yang et al., 14 Jul 2025).

These continuous codes are projected via a learnable linear transformation into the model’s hidden space: xt=Wect+be,WeRdmodel×d,  beRdmodelx_t = W_e\,c_t + b_e,\quad W_e \in \mathbb{R}^{d_{\text{model}} \times d},\;b_e \in \mathbb{R}^{d_{\text{model}}} The resulting sequence of vectors serves as inputs for the autoregressive Transformer, preserving finer structure and information compared to discrete tokens.

2. Transformer Architecture Adaptations

The base model is typically a decoder-only Transformer but with important modifications to accommodate continuous tokens:

  • Input Embedding: All continuous tokens are linearly projected (sometimes with normalization) into the Transformer’s hidden space.
  • Attention Mechanisms: Variants include causal (raster order), flexible bidirectional (random order or masked), block-wise, or frequency/progressive ordering. The random order configuration uses bidirectional attention, allowing unmasked (observed) tokens to mutually attend and masked (to-be-predicted) tokens to attend only to the unmasked set at every iteration (Fan et al., 2024, Yu et al., 7 Mar 2025).
  • Cross-Modal Conditioners: For text-conditional tasks, cross-attention to language embeddings, often from frozen LLMs (e.g., T5-XXL in Fluid), is standard.
  • Efficiency and Locality: Innovations such as linear attention, depthwise convolution, gating mechanisms (KV gates), and block-wise context are incorporated for scalability and spatial locality, notably in LINA (Wang et al., 30 Jan 2026).

The architecture remains structurally similar to language transformers but with modality-specific heads (e.g., per-token diffusion or flow-matching MLPs) for continuous value prediction.

3. Training Objectives for Continuous Token Prediction

For continuous tokens, traditional categorical prediction is replaced by objectives suited for vector-valued targets:

  • Denoising Objectives: Diffusion-based token prediction is prevalent. At training, each continuous token is noised by a schedule (e.g., √α(t)xₜ + √(1−α(t))ε), and the model predicts the added noise, minimizing the L₂ loss:

L=Ec,t,ϵϵθ(xtnoised,t,context)ϵ22\mathcal{L} = \mathbb{E}_{c,\,t,\epsilon} \left\| \epsilon_\theta(x_t^\text{noised}, t, \text{context}) - \epsilon \right\|_2^2

(Fan et al., 2024, Yu et al., 7 Mar 2025, Yang et al., 14 Jul 2025)

  • Flow-Matching: Predicting the instant “velocity” of denoising trajectories for each token, particularly in audio/motion and high-fidelity image generation (Li et al., 12 Feb 2026, Team et al., 14 Aug 2025).
  • Proper Scoring Rules: Energy score minimization (EAR), Hyvärinen score (diffusion), and log-likelihood (GIVT/flow-based) enable likelihood-free or implicit sampling with theoretical guarantees of distributional consistency (Shao et al., 12 May 2025).
  • Hybrid Losses: For multimodal or hybrid architectures (e.g., UniFluid, AGDC), both cross-entropy (for discrete sub-sequences) and per-token continuous losses are combined, often with a balancing coefficient.

Regularization (e.g., weight decay, EMA) is employed to stabilize dense, continuous-space training. For domain-specific settings, auxiliary heads (exit or length regularizers) enable precise sequence termination, especially in variable-length generation (Shin et al., 9 Jan 2026).

4. Generation Strategies and Ordering

Autoregressive factorization over continuous tokens can follow several orderings:

  • Raster/Causal Order: Standard left-to-right generation, where each token depends only on its predecessors. Limitation: previously generated tokens are irrevocable, leading to issues in spatial coherence for images.
  • Random/BERT-style Order: At each iteration, a random permutation determines the prediction order; masked tokens are predicted based on all available context, providing bidirectional context within each generation step. This bidirectionality allows revision and improved global consistency, especially for vision and multimodal generation (Fan et al., 2024, Fan et al., 17 Mar 2025).
  • Blockwise/Hierarchical: Tokens are grouped into blocks or predicted at coarse-to-fine (e.g., low- to high-frequency) granularity. FAR (Yu et al., 7 Mar 2025) employs frequency bands as the AR axis, and blockwise schemes improve computational parallelism without sacrificing causality (Zhang et al., 1 Jul 2025).
  • Fusion with Discrete Tokens: Hybrid approaches condition continuous generation on pre-selected discrete “modes” (e.g., object categories), as in D2C, DisCon, and AGDC, which ensures global structure is determined before fine-grained detail (Wang et al., 21 Mar 2025, Zheng et al., 2 Jul 2025, Shin et al., 9 Jan 2026).

5. Empirical Performance and Comparative Analysis

Autoregressive Transformers with continuous tokens substantially close the gap to, and in some scenarios surpass, diffusion models and discrete AR systems in several metrics:

Model Modality Params Inference FID↓ (ImageNet 256) GenEval (COCO) Highlights
Fluid (rand, cont.) Image 10.5B 6.16 0.69 State-of-the-art among AR T2I (Fan et al., 2024)
NextStep-1 Image 14B ≈6.2 (COCO 30k) 0.73 Unified, strong editing (Team et al., 14 Aug 2025)
LINA-H Image 1.4B 2.18 0.66 Linear attention, efficient (Wang et al., 30 Jan 2026)
D2C-L (q-former) Image 633M 3.14 Outperforms discrete/cont. alone (Wang et al., 21 Mar 2025)
DisCon-L Image 558M 1.38 (gFID) Conditional continuous AR (Zheng et al., 2 Jul 2025)
EAR-H Image 937M 1.97 Single-pass, scoring rule AR (Shao et al., 12 May 2025)
LLaMo-3B Motion 3B FID 22.5 (text→motion) Streaming, no quantization artifact (Li et al., 12 Feb 2026)
ARDiT (B=4, cont.) Audio 170ms/step, near-perfect speech editing (Liu et al., 2024)

Larger models and random-order generation further improve results, with validation loss scaling roughly as a power law with model size for continuous-token AR (Fan et al., 2024). For hybrid and fusion models, conditioning on discrete modes dramatically improves stability and sample quality (Zheng et al., 2 Jul 2025, Wang et al., 21 Mar 2025). In multimodal settings, unified models such as UniFluid and AGDC can simultaneously handle text, vision, and vector data, maintaining or improving performance on both discrete and continuous subspaces (Fan et al., 17 Mar 2025, Shin et al., 9 Jan 2026).

6. Theoretical and Practical Challenges

Modeling continuous tokens poses density estimation and OOD risk challenges: the support is unbounded, and a naive AR chain over continuous space can result in mode collapse or artifact generation (Zheng et al., 2 Jul 2025, Wang et al., 21 Mar 2025). Several architectural and objective innovations address these issues:

7. Extensions, Limitations, and Future Directions

Autoregressive Transformers with continuous tokens are being extended to new domains:

Persisting challenges include stability of unbounded density modeling, efficient long-sequence generation, and further acceleration of diffusion/flow-based sampling in AR contexts. There is active research into learned orderings, blockwise generation, and integration with strong pre-trained LLM or visual backbones (Fan et al., 2024, Yu et al., 7 Mar 2025, Fan et al., 17 Mar 2025).


References:

  • "Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens" (Fan et al., 2024)
  • "NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale" (Team et al., 14 Aug 2025)
  • "LINA: Linear Autoregressive Image Generative Models with Continuous Tokens" (Wang et al., 30 Jan 2026)
  • "Continuous Visual Autoregressive Generation via Score Maximization" (Shao et al., 12 May 2025)
  • "Frequency Autoregressive Image Generation with Continuous Tokens" (Yu et al., 7 Mar 2025)
  • "Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis" (Zheng et al., 2 Jul 2025)
  • "D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens" (Wang et al., 21 Mar 2025)
  • "Autoregressive Diffusion Transformer for Text-to-Speech Synthesis" (Liu et al., 2024)
  • "Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction" (Yang et al., 14 Jul 2025)
  • "LLaMo: Scaling Pretrained LLMs for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens" (Li et al., 12 Feb 2026)
  • "Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics" (Naparstek, 8 Jan 2026)
  • "Unified Autoregressive Visual Generation and Understanding with Continuous Tokens" (Fan et al., 17 Mar 2025)
  • "AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces" (Shin et al., 9 Jan 2026)
  • "Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows" (Zhang et al., 1 Jul 2025)
  • "Mixture of Tokens: Continuous MoE through Cross-Example Aggregation" (Antoniak et al., 2023)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Autoregressive Transformers with Continuous Tokens.