Continuous Latent Tokens in Generative Models

Updated 19 December 2025

Continuous latent tokens are real-valued vector representations that generalize discrete tokens, offering enhanced expressivity and flexibility across domains.
They leverage learned encoders (e.g., VAEs) paired with autoregressive or diffusion-based predictors to transform high-dimensional inputs into effective latent representations.
Empirical benefits include superior generation fidelity, reduced computational overhead, and improved performance in modalities such as audio, images, and motion.

Continuous latent tokens generalize the concept of sequence modeling via discrete symbolic tokens to settings where the underlying structure is best represented by vectors in a continuous space. Unlike discrete tokens, which are drawn from a finite codebook (e.g., words or quantized codes), continuous latent tokens are real-valued vectors, typically low-to-moderate dimensional, used as the atomic units for autoregressive or diffusion-based modeling in modalities such as audio, images, motion, or even abstract reasoning. Continuous tokenization supports fidelity, efficiency, and representation flexibility beyond that achievable with discrete systems, enabling modern models to bridge generation, reasoning, and understanding tasks in a variety of continuous or hybrid discrete-continuous domains.

1. Formal Definition and Construction

Continuous latent tokens are $d$ -dimensional real-valued vectors $x_t \in \mathbb{R}^d$ , assembled into sequences $x = (x_1,\ldots,x_n)$ that parameterize the generative or reasoning process. The mapping from raw modality (e.g., waveform, image, or logic trace) to tokens is usually effected by a learned encoder, typically a variational autoencoder (VAE). For instance, in audio generation, a 10s, 16 kHz waveform is transformed into a Mel-spectrogram $m$ and then encoded by a VAE $V_E$ into latents $v \in \mathbb{R}^{T' \times F' \times C}$ , which are patchified and flattened to yield tokens $x_t \in \mathbb{R}^d$ with $d = C \cdot p^2$ and $n$ determined by the patch-grid (Yang et al., 14 Jul 2025). For vision, an analogous block-wise embedding pipeline maps images to a grid or flattened sequence of continuous tokens, with downsampling and patchification parameters dictating the number and dimensionality (Wang et al., 21 Mar 2025, Team et al., 14 Aug 2025).

Generation operates directly in this latent space: the generative model (causal transformer, masked autoencoder, or bidirectional diffusion) autoregressively or iteratively predicts the next or missing token(s), typically using diffusion-based loss functions. After generation, tokens are mapped back to the data domain by a decoder (VAE decoder, HiFi-GAN vocoder, or image decoder).

2. Training Objectives and Modeling Approaches

Most contemporary continuous-token models eschew the cross-entropy loss used in discrete language modeling, favoring continuous-valued diffusion or flow-matching objectives at the per-token level. In the diffusion setup, the model predicts the conditional score or noise vector for each token given the context and optionally the diffusion timestep:

$\mathcal{L}_{\text{diff}} = \mathbb{E}_{x_t, u, \epsilon} \| \epsilon - \epsilon_\theta(x_t^{(u)}, z_t, u) \|^2,$

where $x_t^{(u)}$ is the noised version of the ground-truth token at time $u$ in the forward diffusion schedule, $z_t$ is the transformer context, and $\epsilon_\theta$ is a small MLP "diffusion head" (Yang et al., 14 Jul 2025, Fan et al., 17 Mar 2025, Team et al., 14 Aug 2025).

Alternative formulations include:

Flow matching: Predicting the instantaneous velocity that maps latent noise to the clean token along a linear or nonlinear ODE trajectory (Team et al., 14 Aug 2025).
Rectified flow: Per-token invertible flows matching trajectories between base noise and data, optimized by vector-field MSE losses (Wu et al., 26 Aug 2025).
Masked autoencoder (MAE) reconstruction + diffusion: Joint diffusion modeling and masked-prediction objectives, often in 2-stage pipelines that serve as discrete-to-continuous bridges (Wang et al., 21 Mar 2025).

Architecturally, the continuous tokens are handled by parameter sharing, transformer blocks that treat tokens as learned embeddings, and diffusion or flow-matching heads attached at each output position.

3. Domain-Specific Implementations

Continuous latent tokens have been instantiated across multiple modalities:

Audio: VAE-based tokenizers with causal Transformer decoders trained using diffusion loss for next-token prediction (Yang et al., 14 Jul 2025). Both audio reconstruction quality (FAD, KL) and subjective naturalness are improved versus discrete systems. Streaming and real-time factors are near-optimal due to reduced token counts and regularized architectures (Wu et al., 26 Aug 2025).
Images: Patch-based VAE encodings result in continuous latent tokens at 16x16 or 32x32 grids. Models employ AR transformers with flow matching or hybrid discrete–continuous stacks (e.g., D2C's two-stage AR pipeline) (Wang et al., 21 Mar 2025, Team et al., 14 Aug 2025). State-of-the-art FID/IS is achieved, outperforming discrete AR and classical diffusion at lower computational cost.
Motion: For human motion interpolation, continuous intermediate tokens are constructed via context-guided attention over temporally-sparse keyframes, forming a latent trajectory manifold exceeding classic interpolation in precision and realism (Mo et al., 2023).
Visuomotor control: Frequency-domain continuous tokens summarize actions per DCT band, with autoregressive frequency sweeps and diffusion-based generation yielding accurate, temporally smooth, and computationally efficient policies (Zhong et al., 2 Jun 2025).
Latent reasoning and chain-of-thought: In language and reasoning, continuous tokens supplant discrete CoT steps by feeding continuous summary vectors into the transformer, enabling parallel or iterative reasoning, breadth-first search–like effects, and more expressive latent representations (Hao et al., 9 Dec 2024, Wu et al., 23 Jun 2025).

4. Hybrid Discrete–Continuous Modeling and Expressivity

Numerous works show that blending discrete and continuous tokens, or inferring continuous latent channels in discrete-oriented models, enhances both expressivity and sample quality:

Coevolutionary diffusion: Models such as CCDD employ joint diffusion in discrete and continuous spaces, allowing each modality to guide the other during training and sampling. This approach leverages the sharp, corrective properties of discrete tokens and the rich, smooth manifold of continuous representations. The joint system outperforms purely discrete or purely continuous baselines on language modeling perplexity and generative NLL (Zhou et al., 3 Oct 2025).
Latent Discrete Diffusion Models (LDDMs): These pair masked discrete diffusion with a continuous latent channel subject to a correlated DDPM process. The continuous path provides global, cross-token regularization, resolving ambiguities and improving few-step sample quality, especially in low-step or few-token unmasking regimes (Shariatian et al., 20 Oct 2025).
Theoretical analysis: It has been formally established that continuous latent token mechanisms can represent superpositions of reasoning paths and simulate looped transformers, giving rise to strictly greater expressivity than discrete-only approaches for certain classes of reasoning or planning tasks (Butt et al., 23 Sep 2025, Hao et al., 9 Dec 2024).

5. Empirical Benefits, Applications, and Best Practices

Empirical results consistently show:

Superior generation fidelity per token, reduced quantization artifacts, and improved downstream reconstruction (audio: FAD/KL, speech: CER/WER, images: FID/IS) (Yang et al., 14 Jul 2025, Team et al., 14 Aug 2025, Wu et al., 26 Aug 2025, Wang et al., 21 Mar 2025).
Lower sequence lengths and latency due to higher information density and better token compositionality (audio: 2048x compression in CLEAR enables near-instantaneous streaming synthesis; image: token counts 256× for high-res images) (Wu et al., 26 Aug 2025, Fan et al., 17 Mar 2025).
Efficiency and flexibility in multimodal and multimodal-reasoning settings. Compact continuous tokens enable VLMs to perform dense visual reasoning, interleaved multimodal chain-of-thought, and cross-modal inference without producing explicit images (Qin et al., 24 Nov 2025, Yang et al., 20 Jun 2025, Ray et al., 11 Dec 2025).
In latent reasoning for LLMs, continuous tokens enable breadth-first, parallel exploration and revision of reasoning trajectories. RL-trained continuous CoT models achieve pass@k diversity superior to discrete CoTs, while preserving out-of-domain accuracy (Butt et al., 23 Sep 2025, Hao et al., 9 Dec 2024, Kang et al., 6 Oct 2025).
Streaming inference and low real-time factors are achievable because short continuous token sequences suffice for high-quality output (Wu et al., 26 Aug 2025).

Recommended practices include using VAE-style encoders with channel and per-patch normalization, diffusion or flow loss per token, continual input normalization, and appropriate balancing of per-modality losses. Masked next-token prediction, masking schedules, and judicious integration of discrete context (for hybrid systems) also contribute to stability and performance.

6. Architectural Considerations and Extensions

Implementing continuous latent tokens entails specialized architecture decisions:

Input embeddings, transformer stack, and output heads must accept and process arbitrary sequences of $d$ -dimensional real-valued tokens, sometimes interspersed with discrete tokens.
Self-attention, feedforward, and normalization layers are unmodified, but heads (diffusion, flow, denoising) are adapted for per-token continuous prediction (Yang et al., 14 Jul 2025, Fan et al., 17 Mar 2025).
Autoregressive factorization is preserved, typically via unified or mixed discrete-continuous sequences with causal or random ordering.
Multihead fusion, cross-attention, and query-based adapters (Q-Former, lightweight cross-attn), inject context or combine discrete and continuous tokens (Wang et al., 21 Mar 2025).
Specialized sampling routines: reverse diffusion (for continuous space) or ODE integration (for flow-matching heads), per token or per block.

Extensions span modality-agnostic latent tokens (Mull-Tokens), adaptive frequency-domain decomposition (FreqPolicy), block-wise structure for long-form latent reasoning (LaDiR), and flexible parallelization (PCCoT via Jacobi iteration) (Ray et al., 11 Dec 2025, Zhong et al., 2 Jun 2025, Kang et al., 6 Oct 2025, Wu et al., 23 Jun 2025).

7. Limitations and Open Directions

While continuous latent tokens unlock fundamental improvements in expressivity, efficiency, and generativity, they introduce practical challenges:

Training stability and decoding robustness require careful loss weighting, normalization, and diffusion schedule management, especially for long autoregressive chains.
For hybrid discrete–continuous systems, coupling and conditional dependence must be managed to avoid drift or divergence between modalities (Zhou et al., 3 Oct 2025, Shariatian et al., 20 Oct 2025).
Interpretability of latent tokens is more opaque than discrete traces, though recent work (block-based VAE decoding, visual token distillation, latent-to-text mapping) is advancing this aspect (Kang et al., 6 Oct 2025, Qin et al., 24 Nov 2025).
Scaling up to extremely long sequences or high token dimensions may strain memory or sampling efficiency, though advances in parallelization and token “compression” are mitigating these bottlenecks (Wu et al., 23 Jun 2025, Fan et al., 17 Mar 2025).
Theoretical questions around optimal superposition, information content, and universality of continuous-token systems remain fertile ground for future inquiry (Butt et al., 23 Sep 2025, Hao et al., 9 Dec 2024).

Collectively, continuous latent tokens have emerged as a unifying abstraction for high-fidelity, efficient, and flexible modeling across generative and reasoning domains, with proven gains over discrete-tokenization in both rebuilding data-rich modalities and supporting structured, expressive reasoning processes.