Continuous Visual Tokens Overview

Updated 26 November 2025

Continuous visual tokens are high-dimensional, real-valued latent vectors that capture both local and global image features without discrete quantization.
They are integrated into autoregressive and multimodal transformer architectures to facilitate high-fidelity visual generation and unified vision–language reasoning.
Key modeling strategies include Gaussian likelihoods, diffusion-based objectives, and hybrid quantization to enhance reconstruction fidelity and training efficiency.

Continuous visual tokens are real-valued, high-dimensional latent vectors that represent local or global regions of an image, replacing or augmenting traditional discrete token representations for autoregressive visual modeling. Unlike discrete tokens, which quantize each image region to the nearest codebook entry, continuous tokens encode more information per patch, preserve fine visual details, and enable direct end-to-end differentiability. They have emerged as a central representation for high-fidelity visual generation, unified vision-language modeling, vision-language reasoning, and cross-domain few-shot transfer, with recent work establishing their advantages and trade-offs against discrete approaches.

1. Mathematical Definition and Extraction

Continuous visual tokens are typically extracted by encoding an input image $I \in \mathbb{R}^{3 \times H \times W}$ with a neural encoder $E$ (often the encoder of a variational autoencoder, VAE), yielding a grid or sequence of real-valued vectors: $X = E(I) \in \mathbb{R}^{H' \times W' \times C}$ where $H' \times W'$ is the spatial downsampled resolution and $C$ the latent channel dimension—examples include $H'=W'=16$ , $C=16$ for 256×256 images (Wang et al., 20 Mar 2025). Each $x_{h,w}$ is a continuous visual token that maintains local appearance information with minimal information loss due to the absence of codebook quantization. The continuous latent features can be extracted by stable diffusion VAEs, pixel shuffle groupings, or specialized patch-based encoders (Fan et al., 17 Oct 2024, Team et al., 14 Aug 2025, Huang et al., 8 Oct 2025).

Methods for extracting continuous tokens vary:

Patchification and direct projection in ViT-based models yield tokens $c_k = f_\text{image}(\text{patch}_k) \in \mathbb{R}^D$ (Chung et al., 24 May 2025).
Compact designs distill expert knowledge into a small number ( $\sim\!$ 20) of continuous visual tokens per image for vision–language reasoning (Qin et al., 24 Nov 2025).
Some frameworks explicitly encode and sequence continuous tokens for both generation and understanding under a unified transformer (Fan et al., 17 Mar 2025, Huang et al., 8 Oct 2025).

2. Architectural Integration into Generative and Multimodal Models

Continuous visual tokens integrate into autoregressive (AR), masked, or bidirectional transformers in several paradigms:

Autoregressive Generation: The visual token sequence is generated or predicted token by token, in raster or random order, with each contextually conditioned on prior tokens and possibly text prompt embeddings (Fan et al., 17 Oct 2024, Team et al., 14 Aug 2025, Fan et al., 17 Mar 2025).
Multimodal LLMs: Visual tokens are either prepended, interleaved, or "pointed to" within a joint sequence model for unified reasoning, generation, and understanding. In this setting, visual tokens share the same embedding space as text tokens or are aligned via adaptive projection (Huang et al., 8 Oct 2025, Chung et al., 24 May 2025). Some approaches allow for "point-and-copy" operations, letting the model explicitly refer back to specific continuous visual tokens to preserve visual grounding during multimodal reasoning (Chung et al., 24 May 2025).
Unified Generation-Understanding Frameworks: Here, both text and image tokens (the latter continuous) are predicted by the same transformer, each handled by a modality-specific output head (softmax for text, regression or diffusion head for image tokens) (Fan et al., 17 Mar 2025, Team et al., 14 Aug 2025).
Reasoning in Visual Space: Small sets of continuous visual tokens serve as latent carriers for structured visual knowledge (e.g., segmentation, depth, edge cues) which are autoregressively predicted and used in chain-of-thought processes (Qin et al., 24 Nov 2025).

3. Training Objectives, Modeling Strategies, and Quantization

Continuous token models require objectives distinct from standard cross-entropy on discrete vocabularies:

Generative Likelihoods: Typically, likelihoods over continuous tokens are modeled as Gaussian or diffusion-based objectives. The loss may be the mean squared error between predicted and target latent vector, or a diffusion-based denoising regression (e.g., optimizing a “diffusion head” to predict added noise in the latent at a random interpolation between clean and noisy states) (Fan et al., 17 Oct 2024, Team et al., 14 Aug 2025, Huang et al., 8 Oct 2025, Fan et al., 17 Mar 2025).
Hybrid Quantization: Some methods bridge continuous and discrete paradigms by discretizing the continuous latent dimensions post-hoc (e.g., via dimension-wise quantization or codebook resampling), enabling more efficient AR modeling with cross-entropy loss while closely matching continuous-token reconstruction fidelity (Wang et al., 20 Mar 2025, Zhang et al., 10 Mar 2025, Chen et al., 3 Nov 2025).
Continuous-to-Discrete Decoupling: TokenBridge, for instance, learns continuous tokens via a VAE and subsequently applies a non-parametric, dimension-wise quantizer without codebook learning. The AR distribution over the quantized indices is efficiently modeled as an autoregressive chain over the latent vector’s dimensions (Wang et al., 20 Mar 2025).
Conditioned AR: Some hybrid approaches treat discrete tokens as conditioning signals, enabling the AR model to first establish high-level structure (via discrete codes) and then use continuous tokens for high-fidelity, detail refinement (Zheng et al., 2 Jul 2025).

4. Empirical Advantages, Limitations, and Comparative Results

Continuous visual tokens afford multiple empirical advantages:

High Reconstruction Fidelity: Continuous encodings consistently achieve lower reconstruction error (e.g., rFID, PSNR, SSIM) than discrete quantization. On ImageNet 256×256, rFID of 1.11 (continuous) is preserved even when discretized with sufficient quantization bins (B=64) (Wang et al., 20 Mar 2025). NextStep-1 and UniFluid report PSNR >30, surpassing VQ-based benchmarks (Team et al., 14 Aug 2025, Fan et al., 17 Mar 2025).
Preservation of Fine Visual Detail: Fine-grained textures, sharp edges, and semantic accuracy in generated images are better preserved relative to VQ/VQGAN discrete approaches (Fan et al., 17 Oct 2024, Fan et al., 17 Mar 2025, Wang et al., 20 Mar 2025).
Stable and Simplified Training: The removal of hard quantization units and codebook learning prevents codebook collapse and unlocks smooth gradient flow through the entire model. Diffusion or flow-matching heads facilitate end-to-end optimization (Fan et al., 17 Oct 2024, Team et al., 14 Aug 2025, Huang et al., 8 Oct 2025).
Improved Scaling and Transfer Learning: AR transformers trained with continuous tokens demonstrate more robust scaling with parameter count, and better transfer to downstream tasks such as compositional image editing, text-based retrieval, and unified captioning/question answering (Fan et al., 17 Oct 2024, Fan et al., 17 Mar 2025, Zhang et al., 10 Mar 2025, Qin et al., 24 Nov 2025).
Efficient Representation: Adaptive quantization (as in CDD-VT) enables images of different complexity to be tokenized with an appropriately variable number of primitives, combining the benefits of both continuous and discrete regimes (Chen et al., 3 Nov 2025).
Vision–Language Reasoning: Supporting hybrid text+visual streams with interleaved or pointed continuous tokens enhances grounded reasoning, spatial understanding, and interpretability in VLMs (Qin et al., 24 Nov 2025, Chung et al., 24 May 2025).

Limitations and trade-offs include:

Distribution Modeling Challenge: Predicting high-fidelity continuous latents, especially in the absence of discrete structure, demands complex density models. Issues of out-of-distribution synthesis and sampling overhead emerge (Zheng et al., 2 Jul 2025, Fan et al., 17 Oct 2024).
Sequential Prediction Cost: Some approaches mitigate the combinatorial explosion of continuous token-space (e.g., B^C in channel-wise quantization) via autoregressive channel orderings, but overall generation remains costlier than simple discrete softmax (Wang et al., 20 Mar 2025).
VAE Bottleneck: Reconstruction and generation quality are inherited from the underlying VAE tokenizer, which itself may be suboptimal (Wang et al., 20 Mar 2025, Team et al., 14 Aug 2025).
Latent Overhead: Large latent vectors per patch (e.g., 64- or 1024-dimensional tokens for high-resolution or dense reasoning) increase compute and storage requirements during training and inference (Qin et al., 24 Nov 2025, Team et al., 14 Aug 2025).

5. Hybrid and Unified Tokenization Strategies

Recent research has advanced hybrid tokenization to combine the architectural and modeling advantages of both token types:

Post-hoc Quantization (TokenBridge, V2Flow): Continuous latents are discretized post-training, ensuring high-fidelity reconstructions while enabling categorical AR modeling (Wang et al., 20 Mar 2025, Zhang et al., 10 Mar 2025).
Continuous–Discrete Duality (CDD-VT): Tokenization is adaptively adjusted per sample based on information complexity, emulating discrete or continuous regimes as needed (Chen et al., 3 Nov 2025).
Semanticization (TokLIP): Discrete low-level VQ tokens are mapped to high-level continuous semantic embeddings, so that the same underlying codes are used for both comprehension (in the continuous CLIP-aligned space) and generation (in discrete autoregressive models) (Lin et al., 8 May 2025).
Unified Reasoning and Generation: Models such as Ming-UniVision and UniFluid process both text and image tokens through one transformer, jointly optimizing continuous (image) and discrete (text) prediction, enabling multi-task learning and seamless in-context vision–language procedures (Huang et al., 8 Oct 2025, Fan et al., 17 Mar 2025).

6. Implications, Open Challenges, and Future Directions

Continuous visual tokens mark a significant shift in visual representation:

Unified Multimodal Modeling: They enable tight integration of vision and language in shared-model architectures, lowering friction between understanding, generation, captioning, and editing (Fan et al., 17 Mar 2025, Huang et al., 8 Oct 2025, Qin et al., 24 Nov 2025).
Fine-grained Visual Reasoning: The capacity for “visual thinking” in continuous latent space bolsters spatial, geometric, and perceptual reasoning abilities in VLMs, as shown in chain-of-visual-thought setups (Qin et al., 24 Nov 2025).
Adaptive and Interpretable Visual Processing: Methods enabling adaptive tokenization or expert-aligned visual tokens provide better alignment to semantic content and task demands (Chen et al., 3 Nov 2025, Qin et al., 24 Nov 2025).
Sampling Complexity and Robustness: Ongoing challenges remain in efficiently modeling the high-dimensional density of continuous latents, minimizing sampling steps, and controlling robustness against out-of-distribution artifacts (Zheng et al., 2 Jul 2025, Fan et al., 17 Oct 2024).

Directions for further research include:

Strengthening the representational ability and efficiency of continuous VAEs and decoders (Wang et al., 20 Mar 2025, Team et al., 14 Aug 2025).
Extending post-training quantization and adaptive tokenization beyond images to video, 3D, or high-dimensional modalities (Wang et al., 20 Mar 2025, Chen et al., 3 Nov 2025).
Investigating advanced density estimation, hybrid sampling (e.g., rectified-flow, normalizing flows), and learned dependency orderings for channel-wise prediction (Zheng et al., 2 Jul 2025, Wang et al., 20 Mar 2025, Zhang et al., 10 Mar 2025).
Formalizing and optimizing continuity metrics for cross-domain generalization and robust visual transfer learning (Yi et al., 3 Jun 2025).

Continuous visual tokens, through careful architectural, quantization, and modeling design, have become a foundational component for state-of-the-art high-fidelity visual generation, multimodal reasoning, unified understanding-generation pipelines, and adaptive vision modeling (Wang et al., 20 Mar 2025, Fan et al., 17 Oct 2024, Team et al., 14 Aug 2025, Qin et al., 24 Nov 2025).