Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models (2510.18457v2)

Published 21 Oct 2025 in cs.CV and cs.LG

Abstract: The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.

Summary

The paper introduces VFM-VAE, which integrates vision foundation models as tokenizers to preserve semantic fidelity without conventional distillation.
It leverages multi-scale latent fusion and progressive resolution blocks to enhance image quality and accelerate convergence on benchmarks like ImageNet.
It presents SE-CKNNA, a novel metric that robustly diagnoses semantic alignment, correlating with improved generative performance.

Overview

The paper "Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models" presents a novel approach to enhance Latent Diffusion Models (LDMs) by integrating Vision Foundation Models (VFMs) directly into the training pipeline without the traditional distillation process. This integration, termed Vision Foundation Model Variational Autoencoder (VFM-VAE), leverages frozen VFM encoders to retain rich semantic features while addressing the trade-off between semantic richness and pixel-level fidelity.

Architectural Innovations

Direct VFM Integration

The approach circumvents the conventional distillation process, which tends to degrade alignment fidelity and introduce brittleness under distribution shifts. By adopting a frozen pre-trained VFM encoder within the VAE framework, the authors preserve the semantic integrity of the input data. The system extracts multi-scale features from various layers of the VFM, ensuring a comprehensive capture of semantic information across different representation levels.

Decoder Enhancements

Multi-Scale Latent Fusion: The decoder incorporates a mechanism to utilize latent components at varying scales. This fusion supports hierarchical representation learning, capturing both global and local semantic structures effectively.
Progressive Resolution Reconstruction Blocks: These blocks facilitate a step-wise refinement of image details, enhancing the decoder's capacity to synthesize high-resolution and high-fidelity outputs.

Training Objectives

The training methodology incorporates multi-component losses to balance reconstruction fidelity with semantic preservation. These include KL divergence for latent space regularization, cosine similarity for semantic alignment, and perceptual losses for visual quality enhancement.

Representation Dynamics and Diagnosis

The paper introduces SE-CKNNA—a representation distance metric that better aligns with semantic representations under transformation. This metric aids in diagnosing the alignment between tokenizer latents and VFM features during training. The alignment strategy improves convergence speed by enhancing representation quality from shallow to deep layers.

Experimental Results

The effectiveness of VFM-VAE is validated through comprehensive experiments on ImageNet. Notable improvements in gFID scores and accelerated training epochs demonstrate the enhanced generative quality and efficiency of VFM-VAE compared to traditional tokenizers. Coupling with modern generative models like REG further highlights its performance, achieving superior semantic alignment and convergence.

Key Contributions

Elimination of Distillation Degradation: By bypassing distillation, VFM-VAE maintains the fidelity of VFM features, supporting robust latent diffusion.
Improved Convergence and Performance: The integration strategy leads to significant speed-ups in convergence and improved generative model performance across multiple benchmarks.
Robust Representation Diagnosis: Introduction of SE-CKNNA as a nuanced tool for semantic equivalence evaluation, correlating strongly with model performance enhancements.

Future Implications

The approach set forth by VFM-VAE opens pathways for enhanced generative tasks across various vision applications. By continuing to refine alignment strategies and broader VFM integration, researchers can achieve even more efficient and high-fidelity image synthesis. The framework's adaptability suggests potential extensions to other modalities and tasks, such as multimodal fusion and real-time image processing applications.

Conclusion

VFM-VAE represents a significant stride in integrating foundational visual models into latent diffusion paradigms. Through architectural innovations and robust training strategies, it offers an efficient and scalable solution that enhances both semantic richness and pixel-level output fidelity, fundamentally improving the performance and applicability of latent diffusion models.