Continuous Image Tokens

Updated 21 August 2025

Continuous image tokens are real-valued vector embeddings derived from images that capture detailed visual information without quantization error.
They power transformer-based models for high-fidelity image synthesis, editing, and effective multimodal integration with enhanced efficiency.
Tokenization techniques like SoftVQ-VAE employ differentiable soft quantization to balance detailed representation with efficient autoregressive generation.

Continuous image tokens are real-valued vector representations derived from images, functioning as the atomic units for generative and recognition models based on transformers or similar architectures. Unlike discrete tokens, which emerge from hard quantization methods (e.g., vector quantization) mapping pixels to finite codebook indices, continuous tokens preserve the full representational granularity of the underlying visual signal, enabling smoother interpolation, richer detail preservation, and more effective cross-modal modeling. They have emerged as a central abstraction for state-of-the-art (SOTA) image generation, understanding, and editing systems, supporting both unimodal and unified multimodal frameworks.

1. Definition and Rationale for Continuous Image Tokens

Continuous image tokens refer to vector-valued embeddings (often of fixed dimension, e.g., ℝ^d) produced by encoding images—typically as patch embeddings via learnable projection, VAE latents, or soft quantization—without hard assignment to a discrete, finite set of categories. In contrast with discrete tokens, continuous tokens:

Avoid quantization error and information loss
Support fine-grained and localized editing or manipulation
Enable fully differentiable training pipelines, crucial for end-to-end optimization

Continuous tokens can be obtained via approaches such as:

Patch-wise linear projections in Vision Transformers (ViT) (Yi et al., 3 Jun 2025)
Latent representations from VAEs with Gaussian priors (Chen et al., 2024, Fan et al., 17 Mar 2025)
Soft vector quantization or “soft assignment” to codebooks (Chen et al., 2024)
Tokenization schemes that produce multi-channel, spatially organized vector maps subsequently flattened and sequenced for transformer consumption (Team et al., 14 Aug 2025)

This design is motivated by the limitations of discrete tokens: bounded representational capacity, extensive vocabulary requirements for adequate reconstruction, and increased modeling rigidity. Continuous tokens address these issues by operating in a high-dimensional, unbounded space, preserving subtle image statistics and allowing for nuanced generative modeling.

2. Transformer-based Architectures and Generation Paradigms

Continuous image tokens are typically processed within transformer-based architectures for both synthesis and understanding tasks. Two major autoregressive paradigms are found in recent literature:

Next-token prediction: Sequentially modeling tokens in either raster order (left-to-right, top-to-bottom) or in random order, where bidirectional attention (BERT-like) enables better global context modeling (Fan et al., 2024, Team et al., 14 Aug 2025).
Stage-wise or hierarchical generation: Progressive refinement from coarse, low-resolution continuous token maps to fine, high-resolution ones, often using conditional flow-based or diffusion modeling at each stage (Yuan et al., 2024, Yu et al., 7 Mar 2025).

The synthesis process for transformer-based generators with continuous tokens usually involves:

Projecting an image to a sequence or grid of continuous tokens.
Decoding or generating these tokens via an autoregressive transformer, conditioned on text or other modalities if required.
Optionally using a flow-matching or diffusion head to denoise the token representation back to the pixel/image space (Team et al., 14 Aug 2025, Yuan et al., 2024).

Self- and cross-attention mechanisms enable independent or content-aware modulation at the token level, supporting localized and content-driven control, a property exploited in style transfer and editing applications (Zeng et al., 2021).

3. Tokenization Techniques and Differentiability

Continuous tokenization can be framed as a soft, differentiable alternative to classic vector quantization schemes:

SoftVQ-VAE (Chen et al., 2024) constructs a softmax over codebook distances, producing a weighted sum of multiple codewords (soft-categorical posterior). This results in high-capacity, fully differentiable latent spaces and supports much greater compression (e.g., 32 tokens for 256×256 images).
Dimension-wise or post-training quantization (Wang et al., 20 Mar 2025) decouples VAE training from discretization, discretizing each channel post hoc and facilitating efficient, channelwise autoregressive modeling.
Space-to-depth rearrangement and normalization (Team et al., 14 Aug 2025) transforms multi-channel VAE latents into sequential streams of continuous tokens, stabilized with channel-normalization and stochastic noise injection to regularize the latent space.

The differentiability of these schemes is critical for aligning generated latents with semantic pre-trained features and for end-to-end tasks where gradients must propagate through the encoding and decoding processes.

Tokenization Scheme	Key Properties	Example Paper
Hard VQ/VQ-VAE	Discrete, non-differentiable	(Guo et al., 20 Mar 2025)
Soft VQ (SoftVQ-VAE)	Continuous, differentiable, soft	(Chen et al., 2024)
Continuous VAE	Fully continuous, KL regularized	(Fan et al., 17 Mar 2025 Team et al., 14 Aug 2025)

4. Hybrid and Bridged Representations

Recent frameworks leverage hybrid strategies to exploit the structure of both continuous and discrete tokens:

Discrete-to-Continuous: First predict discrete tokens (for structure) and condition continuous token prediction (for details) on these, simplifying complex density estimation (Zheng et al., 2 Jul 2025, Wang et al., 21 Mar 2025). The discrete tokens act as stable global priors, while the continuous tokens refine local detail.
Bridging Methods: TokenBridge (Wang et al., 20 Mar 2025) first trains a VAE on continuous latents and subsequently discretizes these by dimension, allowing standard categorical loss for modeling with minimal quality loss. D2C (Wang et al., 21 Mar 2025) generates discrete tokens followed by continuous, enabling high fidelity and efficient modeling.

These strategies reflect a growing consensus that balancing global guidance (via discrete priors) with detailed refinement (via continuous tokens) yields SOTA fidelity, robustness, and modeling efficiency.

5. Performance, Efficiency, and Comparative Evaluation

Multiple benchmarks have established that continuous image tokens yield substantial advantages in generative modeling:

Significantly improved image fidelity and visual quality (lower FID, higher Inception Score) compared to discrete-token autoregressive methods, as evidenced by Fluid’s zero-shot FID of 6.16 on MS-COCO (Fan et al., 2024) and NextStep-1’s SOTA GenEval scores (Team et al., 14 Aug 2025).
High compression capability without significant loss, as demonstrated by SoftVQ-VAE’s use of as few as 32 tokens for 256×256 images while achieving FID scores of 1.78 (Chen et al., 2024).
Major inference and training throughput improvements: up to 18× (256² images) and 55× (512² images) speedups over conventional discrete models (Chen et al., 2024).
Enhanced sample efficiency, reduction in required model size and computational cost via multistage or hierarchical generation (Yuan et al., 2024).
Unification of text and image tokens within a single transformer, supporting efficient multimodal (text+image) tasks and robust cross-representational learning (Team et al., 14 Aug 2025, Fan et al., 17 Mar 2025, Zheng et al., 2 Jul 2025).

Recent models have also demonstrated robust image editing, interpolation, and inversion capabilities enabled by the continuous, traversable latent space (Team et al., 14 Aug 2025).

6. Challenges, Limitations, and Open Directions

While continuous tokens improve detail and smoothness, they pose unique challenges:

Density estimation: The unbounded, high-dimensional continuous token space complicates probabilistic modeling, increasing risk of out-of-distribution artifacts (Zheng et al., 2 Jul 2025).
Optimization: Direct continuous next-token prediction can be unstable. Conditioning on discrete priors or “bridging” with post-hoc discretization mitigates these issues and improves optimization stability (Wang et al., 21 Mar 2025, Wang et al., 20 Mar 2025, Zheng et al., 2 Jul 2025).
Hybrid modeling trade-offs: Hybrid or two-stage models yield improved results but increase pipeline complexity and require careful architectural and loss-balancing choices, especially for unified multimodal models (Fan et al., 17 Mar 2025, Zheng et al., 2 Jul 2025).
Generalization and transfer: In cross-domain settings, continuity of tokens supports large spatial pattern learning, but such patterns can hinder domain transfer; manipulating continuity may enable better few-shot or domain-robust learning (Yi et al., 3 Jun 2025).
Efficient scaling: The quadratic complexity of transformers with respect to token count has motivated aggressive compression and multistage or frequency-based sequential generation (Yuan et al., 2024, Yu et al., 7 Mar 2025).

Areas for advancement include adaptive or hierarchical continuous tokenization, aligning continuous latents to powerful semantic pre-trained representations, exploring improved bridging/fusion modules, and extending frameworks robustly to higher dimensions (video, 3D), as in VideoMAR (Yu et al., 17 Jun 2025).

7. Applications and Broader Implications

Continuous image tokens have broad utility:

High-fidelity image synthesis and editing, with regionally controlled style transfer, localized manipulation, and smooth interpolation (Zeng et al., 2021, Team et al., 14 Aug 2025).
Unified generative modeling, enabling autoregressive models to leverage a single backbone for text and image synthesis/understanding; critical for foundation models (Fan et al., 17 Mar 2025, Team et al., 14 Aug 2025).
Efficient large-scale multimodal systems and vision-LLMs, particularly where memory and compute constraints make aggressive latent compression indispensable (Chen et al., 2024, Pippi et al., 6 Mar 2025).
Cross-domain adaptation and generalization, via manipulation of token continuity to improve transfer of in-patch features over larger spatial patterns (Yi et al., 3 Jun 2025).
Video generation and multimodal fusion, as continuous tokens naturally support 3D and temporal extension with efficient decoding (Yu et al., 17 Jun 2025).

The move from discrete to continuous token representations marks a substantial shift in both the design of generative models and the practical scalability, generality, and controllability of vision and multimodal systems, with leading approaches demonstrating SOTA performance across fidelity, efficiency, and multimodal integration benchmarks.