Papers
Topics
Authors
Recent
2000 character limit reached

Continuous-Token Autoregressive Transformers

Updated 16 October 2025
  • Continuous-token autoregressive Transformers are models that generalize discrete next-token prediction to high-dimensional domains, enhancing fidelity and scalability.
  • They leverage techniques like diffusion loss, flow matching, and direct regression to accurately model fine-grained details in modalities such as vision, audio, and robotics.
  • Practical implementations demonstrate improved quality and efficiency, enabling advanced applications in image generation, time series forecasting, and multimodal learning.

Autoregressive Transformers with continuous tokens are a generalization of the standard Transformer architecture, extending its autoregressive next-token prediction paradigm—historically defined for discrete token spaces—into high-dimensional, continuous domains. This transition supports finer-grained modeling and avoids fundamental limitations of discretization, which is particularly relevant for applications in vision, audio, robotics, and time series data.

1. Fundamental Concepts and Rationale

Traditional autoregressive Transformers operate on sequences of discrete tokens, modeling the joint distribution p(x1,...,xn)p(x_1, ..., x_n) in a factorized manner, typically as p(x)=i=1np(xix<i)p(x) = \prod_{i=1}^n p(x_i | x_{<i}), where each xix_i is a categorical variable. However, many modalities—images, audio, time series, actions—are naturally continuous and lose fidelity when quantized. Continuous token modeling overcomes quantization-induced information loss and improves representation capacity. Methods for continuous tokens include:

2. Model Architectures and Continuous Tokenization

The architectural shift centers on three domains:

A. Tokenization:

B. Autoregressive Modeling:

C. Output Heads and Losses:

3. Domain-Specific Implementations

Continuous token autoregressive modeling has seen concrete deployment across:

Domain Method Highlights Key Metrics
Vision Random-order AR with continuous tokens [Fluid, (Fan et al., 17 Oct 2024)]; FAR with frequency AR (Yu et al., 7 Mar 2025); Hybrid tokenization [HART, (Tang et al., 14 Oct 2024)]; Flow matching [NextStep-1, (Team et al., 14 Aug 2025)] FID, GenEval, PSNR
Audio AudioNTP and AudioMNTP: Continuous-token AR with token-wise diffusion and masked next-token tasks (Yang et al., 14 Jul 2025) FAD, FD, KL, CLAP, IS
Speech DiTAR: Patch-based AR with LM and diffusion transformer (Jia et al., 6 Feb 2025) Speaker similarity, TTS metrics
Video VideoMAR: AR with continuous tokens, next-frame diffusion loss, KV caching (Yu et al., 17 Jun 2025); GPDiT: AR diffusion transformer with rotation-based time conditioning (Zhang et al., 12 May 2025) VBench-I2V; throughput, diversity
Language TarFlowLM: AR normalizing flows for sentence-level continuous latent codes (Zhang et al., 1 Jul 2025); SONAR-LLM: AR over continuous sentence embeddings with cross-entropy supervision (Dragunov et al., 7 Aug 2025) BPC, perplexity, NLG metrics
Time series Minimal adaptations: linearly mapping continuous tokens, expanded positional encoding (Kämäräinen, 12 Mar 2025) Forecasting accuracy
Robotics FreqPolicy: AR in DCT frequency space with continuous tokens, hierarchical generation (Zhong et al., 2 Jun 2025) Task success, efficiency

4. Optimization and Efficiency Techniques

Several efficiency-oriented developments have been introduced:

  • Shortcut heads and flow matching: FAR achieves up to 2.3× faster inference versus MAR (diffusion-based AR), with comparable FID (Hang et al., 24 Apr 2025).
  • Hierarchical frequency progression: FAR for vision (Yu et al., 7 Mar 2025) and FreqPolicy for robotics (Zhong et al., 2 Jun 2025) autoregressively build solutions in the frequency domain, stabilizing low-frequency structure before refining detail.
  • Temporal and spatial curriculum: VideoMAR employs short-to-long training and progressive resolution to manage long video sequences efficiently (Yu et al., 17 Jun 2025).
  • Parallel generation: Within frames, VideoMAR uses bidirectional attention for parallel masked prediction; spatial/temporal extrapolation is afforded by 3D relative position embeddings (Yu et al., 17 Jun 2025).
  • Token mixing and transition tuning: MoT models aggregate token mixtures cross-example for continuous, scalable MoE, with a softmax temperature for privacy and interpretable routing (Antoniak et al., 2023).
  • Cache management: Efficient key-value (KV) caching is retained, especially for causal decoding, enhancing throughput and memory use (Tang et al., 14 Oct 2024, Team et al., 14 Aug 2025).
  • Scaling in vision: As model size increases, continuous token models (like Fluid) maintain or improve FID and GenEval, while discrete-token counterparts saturate or degrade due to quantization bottlenecks (Fan et al., 17 Oct 2024).
  • Trade-offs: Unified models (UniFluid (Fan et al., 17 Mar 2025)) show a loss-balance hyperparameter λ\lambda affects both image generation (FID, GenEval) and understanding (CIDEr, QA scores). Careful selection of λ\lambda enables competitive multitask performance.
  • Comparative quality: DisCon (Zheng et al., 2 Jul 2025) and TokenBridge (Wang et al., 20 Mar 2025) show that continuous tokens—either directly or as post-quantized discrete proxies—yield reconstruction and generation quality on par with or superior to prior discrete AR approaches. DisCon achieves gFID 1.38 on ImageNet 256×256; TokenBridge achieves similar results with nearly 6× speedup in token prediction.

6. Extensions and Open Research Directions

  • Hybrid paradigms: Models like HART (Tang et al., 14 Oct 2024) and DisCon (Zheng et al., 2 Jul 2025) combine discrete- and continuous-token AR, using discrete tokens as high-level priors or conditioning signals to stabilize dense generation.
  • Normalizing flows for flexible language modeling: TarFlowLM (Zhang et al., 1 Jul 2025) demonstrates exact invertible modeling for continuous latent texts, enabling bi-directional context and multi-pass hierarchical editing.
  • Multimodal architectures: NextStep-1 (Team et al., 14 Aug 2025), UniFluid (Fan et al., 17 Mar 2025), and GPDiT (Zhang et al., 12 May 2025) open new approaches for joint text-image, video, or image editing and question answering tasks using AR with continuous tokens.
  • Efficient continuous token discretization: Post-training quantization as in TokenBridge may inspire future architectures to merge generation efficiency of categorical loss with the expressivity of continuous latent spaces.
  • Future refinement: Potential exists for improving residual token modeling (e.g., lighter-weight diffusion or shortcut heads), alternate positional embeddings, and scalable patch-wise prediction.

7. Limitations and Controversies

  • Computational overhead: Early continuous token AR models (MAR (Hang et al., 24 Apr 2025); lookahead attention (Du et al., 2023)) incur significant inference cost due to iterative denoising or bidirectional attention. Shortcut heads and efficient discretization have mitigated but not fully solved these issues.
  • Robustness/stability: Continuous distributions impose density estimation challenges and risk out-of-distribution artifacts. DisCon (Zheng et al., 2 Jul 2025) circumvents this via discrete conditioning, and post-training quantization (TokenBridge) avoids codebook instability.
  • Interpretability: Continuous token outputs can be less interpretable compared to discrete symbol sequences; hybrid or conditional paradigms partially address this.
  • Multitask trade-offs: Simultaneously optimizing for generation and understanding may reduce performance in either task if loss balance is not tuned correctly (Fan et al., 17 Mar 2025).

Autoregressive Transformers with continuous tokens encompass a rapidly expanding family of models that generalize AR generation and understanding across modalities. By moving beyond quantization, these models achieve gains in fidelity, scaling, and flexibility, while ongoing research addresses computational efficiency and stability in high-dimensional continuous spaces. This paradigm shift is evidenced across image, audio, video, language, and robotics domains, with a variety of architectures and optimization strategies now demonstrating state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Autoregressive Transformer with Continuous Tokens.