Papers
Topics
Authors
Recent
2000 character limit reached

Diffusion Tokenizer (DiTo)

Updated 7 January 2026
  • Diffusion Tokenizer (DiTo) is a neural autoencoder framework that leverages diffusion models to create compact, continuous tokens for visual data.
  • It employs a single diffusion objective for reconstruction, achieving high fidelity and semantic preservation without adversarial losses.
  • Extensions to video and multimodal tasks demonstrate faster inference and robust performance compared to traditional VAE or GAN-based tokenizers.

A Diffusion Tokenizer (DiTo) is a neural autoencoder framework leveraging diffusion models as the principal component for learning compact, continuous representations (tokens) of visual signals—images or video—for downstream generative or comprehension tasks. Rather than using conventional VAE architectures or heavily supervised objectives, DiTo approaches employ diffusion-based reconstruction, often within a self-supervised learning paradigm, to produce reconstruction-fidelity-maximizing latent spaces that are especially well-suited for use by latent diffusion models. Recent variants have also extended these principles to video, multimodal scenarios, and foundation encoder alignment.

1. Architectural Design and Core Components

Diffusion Tokenizers are generally structured as autoencoders, consisting of:

  • Encoder: A convolutional or transformer-based network mapping inputs (images or spatiotemporal volumes) into a compressed continuous latent zz. For images, this typically takes the form E(x):R3×H×W→RC×H/f×W/fE(x):\mathbb{R}^{3\times H\times W}\rightarrow\mathbb{R}^{C\times H/f\times W/f} (e.g., C=4C=4, f=8f=8) (Chen et al., 30 Jan 2025). Video architectures extend this with 3D convolutions or transformer blocks for temporal handling (Yang et al., 5 Mar 2025, Ge et al., 2024).
  • Decoder: A diffusion UNet conditioned on the latent zz, reconstructing pixels from noise via a learned denoising process. In classic DiTo, the decoder receives as input a noised sample xtx_t (for images) or VtV_t (for video), together with the upsampled or projected latent(s), guiding reconstruction through an iterative (or in some cases distilled single-step) denoising trajectory (Chen et al., 30 Jan 2025, Vallaeys et al., 6 Oct 2025).
  • Latent space: The diffusion framework's reconstruction objective pressures the latent to encode all information necessary for faithful recovery, up to the decoder's capacity, and supports plugging these latents into downstream generative diffusion models. Architectural variants include hybrid UNet:Transformer decoders (Vallaeys et al., 6 Oct 2025), adapter-based injection (for foundation model alignment) (Chen et al., 29 Sep 2025), and spatial–temporal transformers for video tokenization (Ge et al., 2024).

2. Training Objectives and Losses

The distinguishing feature of DiTo-style architectures is the use of a diffusion objective as the sole or primary loss:

  • Single-loss flow matching (ELBO-based): The dominant objective is a flow-matching/continuous ELBO, formulated as:

Lflow(x)=Et,ϵ∥Dθ(xt,t,z)−v∥22L_{\text{flow}}(x) = \mathbb{E}_{t,\epsilon} \Big\| D_\theta(x_t, t, z) - v \Big\|_2^2

where xt=αtx+σtϵx_t = \alpha_t x + \sigma_t \epsilon, v=(1−σmin)ϵ−xv = (1-\sigma_\text{min})\epsilon - x (Chen et al., 30 Jan 2025). For standard DDPM/score-matching, the corresponding reconstruction or noise-prediction loss functions are used (Yang et al., 5 Mar 2025, Ge et al., 2024).

3. Alignment to Foundation Encoders

A major evolution is the three-stage alignment of large pretrained vision transformers (e.g., DINOv2, CLIP) as tokenizers (Chen et al., 29 Sep 2025):

  1. Latent alignment (stage 1): Adapter and decoder are trained (encoder frozen) using a reconstruction loss to establish a semantically structured latent space.
  2. Perceptual alignment (stage 2): Jointly unfreeze encoder, adapter, and decoder, optimizing both reconstruction and semantic-preservation losses to retain high-level semantics while improving low-level detail.
  3. Decoder refinement (stage 3): Only the decoder is fine-tuned for maximal reconstruction fidelity with encoder and adapter frozen.

This procedure yields latents encoding both class-level structure (linear-probe accuracy ≈ 35–41%) and high-fidelity image details (rFID ≈ 0.26) (Chen et al., 29 Sep 2025). Such semantically meaningful latents support accelerated, robust training of downstream diffusion models.

4. Extensions to Video and Multimodal Tokenization

DiTo strategies extend naturally to video by employing causal 3D-convolutions, 3D UNets, and transformers that process spatiotemporal volumes (Yang et al., 5 Mar 2025, Ge et al., 2024):

  • Conditioned diffusion decoders: Video-specific variants use a conditional diffusion decoder, reconstructing sequential frames from compressed video latents, often with single- or few-step DDIM/DDPM sampling for efficient inference (Yang et al., 5 Mar 2025).
  • Temporal continuity and feature caching: Efficient streaming of long videos leverages feature caches for chunk-wise, memory-efficient processing while maintaining continuity across boundaries (Yang et al., 5 Mar 2025).
  • LLM integration for comprehension/generation: Advanced architectures such as Divot inject continuous video tokens into LLMs using GMM-based representation, supporting both video-to-text and text-to-video generation and enabling instruction-following applications such as video storytelling (Ge et al., 2024).

5. Empirical Outcomes and Comparison to Prior Tokenizers

Extensive experiments have established that diffusion-based tokenizers provide advantages over traditional KL-VAE or GAN-regularized autoencoders for both image and video:

  • ImageNet and LAION: DiTo image tokenizers reach low rFID (0.26), maintain semantic class structure, and enable latent diffusion models to attain faster and stronger generation (gFID 1.90 after 80K steps vs. 300K for VA-VAE) (Chen et al., 29 Sep 2025).
  • Video tasks: DiTo and related approaches achieve higher PSNR, SSIM, and lower LPIPS than prior VAE-based video tokenizers (e.g., CDT-B: PSNR ≈ 36.38 dB; LPIPS ≈ 0.0195) while supporting 60–70% shorter inference times (Yang et al., 5 Mar 2025, Ge et al., 2024).
  • Efficiency and throughput: Teacher–student distillation yields single-step decoders that match iterative diffusion models’ reconstruction quality with >8x speedup. SSDD achieves rFID 0.50 (vs. 0.87 for KL-VAE) with 1.4x higher throughput (Vallaeys et al., 6 Oct 2025).
  • Downstream generative modeling: Preservation of semantic structure and perceptual detail in the latent space enables generative diffusion models to converge faster and synthesize higher-quality samples (DiTo: gFID 2.17, VA-VAE: 3.13 on ImageNet) (Chen et al., 29 Sep 2025).

6. Advantages, Limitations, and Extensions

Advantages

  • Simplicity and self-supervision: Core DiTo architectures rely on a single theoretically grounded diffusion objective (ELBO), requiring no adversarial or perceptual hacks, pretrained feature losses, or complex multi-stage scheduling (Chen et al., 30 Jan 2025).
  • Semantic structure: Alignment with foundation encoders enables diffusion-friendly latents which are both semantically meaningful and reconstructive (Chen et al., 29 Sep 2025).
  • Modularity and scalability: The approach is compatible with frozen encoders, supports plug-and-play replacements in existing generative models, and is easily extensible to higher resolutions or different modalities (Vallaeys et al., 6 Oct 2025, Ge et al., 2024).
  • Zero-shot generalization: Fully convolutional designs generalize across image resolutions without retraining (Chen et al., 30 Jan 2025).

Limitations

  • Speed–quality tradeoff: Iterative diffusion decoding imposes computational overhead vs. feed-forward VAE or GAN decoders (∼50 steps, though distillation reduces this gap) (Chen et al., 30 Jan 2025, Vallaeys et al., 6 Oct 2025).
  • Compression granularity: Compression ratios are fixed by architectural downsampling; adaptive or VQ-style token allocation remains an open area (Chen et al., 30 Jan 2025).

Prospective Extensions

  • One-step distillation for real-time pipelines (Vallaeys et al., 6 Oct 2025).
  • Content- or region-aware adaptive tokenization.
  • Multi-task or multimodal diffusion tokenizers for unified comprehension and generation tasks (Ge et al., 2024).

7. Summary of Key Methods and Metrics

Paper / Method Modality Loss Terms rFID / LPIPS / gFID Sampling Steps Notable Features
DiTo (Chen et al., 30 Jan 2025) Image Diffusion L2 (v-pred) rFID: 7.95 (XL, 256px) 50 Single loss, self-supervised, convolutional
CDT (Yang et al., 5 Mar 2025) Video MSE-diffusion, KL, LPIPS PSNR: 36.38, LPIPS: 0.0195 1–3 Causal 3D conv, feature-cache, conditioned reverse
SSDD (Vallaeys et al., 6 Oct 2025) Image Flow matching, LPIPS, REPA rFID: 0.38–0.50, LPIPS: 0.05 1 (student) Single-step distillation, U-ViT, GAN-free
Foundation DiTo (Chen et al., 29 Sep 2025) Image L1, perceptual, GAN, semantic rFID: 0.26, gFID: 1.90 30–50 Foundation encoder alignment, semantic preservation
Divot (Ge et al., 2024) Video+LLM Diffusion L2 (+GMM for LLM) FVD: 301.4, CLIP-SIM: 0.294 50 Perceiver resampler, LLM integration, narrative-video

Performance metrics span rFID (reconstruction FID), LPIPS (perceptual similarity), gFID (generation FID in latent-diffusion pipelines), PSNR/SSIM (video), FVD (video generation), and downstream human preference and zero-shot capabilities.


Diffusion Tokenizers (DiTo and derivatives) define a family of scalable, theoretically grounded, and empirically validated autoencoding frameworks that leverage diffusion models for efficient, high-quality visual representation learning and tokenization for generative and comprehension models. Their ascendancy marks a shift toward semantically structured, loss-efficient, and stable pipelines in both image and video generative modeling (Chen et al., 30 Jan 2025, Yang et al., 5 Mar 2025, Vallaeys et al., 6 Oct 2025, Chen et al., 29 Sep 2025, Ge et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Diffusion Tokenizer (DiTo).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube