Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 75 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 98 tok/s Pro

Kimi K2 226 tok/s Pro

GPT OSS 120B 447 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers (2504.10483v1)

Published 14 Apr 2025 in cs.CV and cs.LG

Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

Collections

Summary

The paper demonstrates that integrating the REPA loss into end-to-end training significantly improves convergence speed and image quality.
It introduces techniques like batch normalization and stop-gradient to update both VAE and diffusion model effectively without latent space collapse.
Results on ImageNet reveal up to 45x faster convergence and state-of-the-art FID scores compared to traditional two-stage methods.

This paper introduces REPA-E, a novel end-to-end (E2E) training recipe for Latent Diffusion Models (LDMs) that jointly optimizes both the Variational Autoencoder (VAE) tokenizer and the diffusion transformer (2504.10483). Traditionally, LDMs are trained in two stages: first training the VAE, then freezing it and training the diffusion model. The paper investigates if joint E2E training can improve performance and efficiency.

The authors first show that a naive E2E approach, where the standard diffusion loss is backpropagated through the VAE, is ineffective. This method leads to a collapse or over-simplification of the VAE's latent space, degrading the final image generation quality, even though it might make the denoising task easier for the diffusion model.

To overcome this, REPA-E proposes using the Representation Alignment (REPA) loss (Yu et al., 9 Oct 2024) as the primary driver for E2E updates to the VAE. The key insights motivating this are:

Higher REPA alignment scores (measured by CKNNA) correlate with better generation performance (lower FID).
The maximum alignment achievable with standard REPA (where the VAE is fixed) is bottlenecked by the VAE's features.
Backpropagating the REPA loss through the VAE allows the latent space to adapt, improving alignment scores beyond the limits of a fixed VAE.

REPA-E Implementation Details:

Batch Normalization: A Batch Normalization layer is inserted between the VAE output and the diffusion model input. This provides differentiable normalization of latent features, adapting dynamically as the VAE updates, avoiding the need to recompute dataset statistics. Affine transformations in BN are disabled.
E2E REPA Loss: The REPA loss, which measures the cosine similarity between patches of intermediate diffusion model features (projected) and features from a fixed pretrained perceptual model (e.g., DINOv2), is calculated. The gradient of this loss is used to update parameters of the diffusion model ( $\theta$ $θ$ ), the VAE ( $\phi$ $ϕ$ ), and the REPA projection layer ( $\omega$ $ω$ ).
- $\mathcal{L}_{\mathrm{REPA}}(\theta, \phi, \omega) = - \mathbb{E}_{\mathbf{x},\epsilon, t} \left[ \frac{1}{N} \sum_{n=1}^N \mathrm{sim}(\mathbf{y}^{[n]}, h_{\mathcal{\omega}}(\mathbf{h}^{[n]}_t)) \right]$
Diffusion Loss (Stop-Gradient): The standard diffusion loss ( $\mathcal{L}_{\mathrm{DIFF}}$ ) is still applied, but only to update the diffusion model parameters ( $\theta$ ). A stop-gradient operation prevents this loss from updating the VAE parameters ( $\phi$ ).
VAE Regularization: To maintain the VAE's reconstruction quality and prevent divergence, standard VAE training losses are added as regularization ( $\mathcal{L}_{\mathrm{REG}}$ ). These include reconstruction losses (MSE, LPIPS), a GAN loss, and a KL divergence loss, applied only to VAE parameters ( $\phi$ ).
Overall Loss: The final loss combines these components:
- $\mathcal{L}(\theta, \phi, \omega) = \mathcal{L}_{\mathrm{DIFF}}(\theta) + \lambda \mathcal{L}_{\mathrm{REPA}}(\theta, \phi, \omega) + \eta \mathcal{L}_{\mathrm{REG}}(\phi)$
- Specific weights $\lambda$ and $\eta$ are used (e.g., $\lambda_{\text{REPA}_g}=0.5$ for SiT, $\lambda_{\text{REPA}_v}=1.5$ for VAE).

Key Findings and Contributions:

Accelerated Training: REPA-E significantly speeds up convergence. On ImageNet 256x256 using SiT-XL, it reaches competitive FID scores much faster than vanilla training or standard REPA (claims of >17x faster than REPA, >45x faster than vanilla). For instance, it achieves FID 4.07 in 400K steps, outperforming REPA's FID 5.9 after 4M steps.
Improved Generation Performance: REPA-E sets new state-of-the-art results on ImageNet 256x256, achieving FID 1.83 (without CFG) and 1.26 (with CFG) after 800 epochs.
Adaptive VAE Improvement: E2E training automatically improves the VAE's latent space structure. For VAEs with noisy latents (like SD-VAE), it learns smoother representations. For VAEs with over-smoothed latents (like IN-VAE or VA-VAE), it learns more detailed structures.
Better, Reusable VAEs: VAEs fine-tuned with REPA-E (termed E2E-VAE) show improved reconstruction FID (e.g., 0.28 for E2E-VAE derived from VA-VAE) and can be used as superior "drop-in" replacements for their original counterparts when training new diffusion models in the traditional two-stage manner.
Generalization: The benefits of REPA-E hold across different diffusion model sizes (SiT-B/L/XL), VAE architectures (SD-VAE, IN-VAE, VA-VAE), REPA perceptual encoders (DINOv2, CLIP, I-JEPA), and REPA alignment depths.
Training from Scratch: REPA-E can effectively train both the VAE and diffusion model entirely from scratch in a single stage, still outperforming two-stage methods like standard REPA.

In summary, REPA-E provides a practical and effective method for end-to-end training of latent diffusion models by leveraging representation alignment loss to guide the joint optimization of the VAE and the diffusion network. This leads to faster training, better final image quality, and improved VAEs suitable for broader use.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (6)

GitHub

Tweets

https://twitter.com/1jaskiratsingh/status/1912634637291254127

https://twitter.com/ivanprado/status/1918318947234959642