Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 209 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Improving the Diffusability of Autoencoders (2502.14831v3)

Published 20 Feb 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K $2562$ and FVD by at least 44% for video generation on Kinetics-700 $17 \times 2562$. The source code is available at https://github.com/snap-research/diffusability.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces scale equivariance regularization to reduce high-frequency imbalances in autoencoder latent spaces used by latent diffusion models.
  • It achieves significant quality improvements, including a 19% drop in FID for images and up to a 49% decrease in FVD for videos with minimal fine-tuning.
  • The method maintains or slightly improves reconstruction quality over standard KL regularization, leading to smoother denoising trajectories in diffusion processes.

This paper addresses the interaction between autoencoders (AEs) and diffusion backbones in Latent Diffusion Models (LDMs), focusing on an underexplored aspect called "diffusability"—how suitable an AE's latent space is for the diffusion process. The authors perform a spectral analysis using the Discrete Cosine Transform (DCT) on the latent spaces of modern AEs (like FluxAE, CosmosTokenizer, CogVideoX-AE, LTX-AE) and find they contain excessive high-frequency components compared to natural RGB images. This issue is more pronounced in AEs with larger bottleneck channel sizes, which are often used to improve reconstruction quality.

The core hypothesis is that these unnatural high-frequency components interfere with the inherent coarse-to-fine synthesis process of diffusion models, thereby degrading the final generation quality. The paper also shows that standard KL divergence regularization used in Variational Autoencoders (VAEs) can worsen this spectral imbalance by introducing more high-frequency noise.

To address this, the paper proposes a simple yet effective regularization strategy called Scale Equivariance (SE). The goal is to align the spectral properties of the latent space with the RGB space. This is achieved by enforcing scale equivariance in the AE's decoder during a short fine-tuning phase:

  1. Both the input image xx and its corresponding latent representation $z = \Enc(x)$ are downsampled (e.g., using 2x or 4x bilinear downsampling) to get $\Tilde{x}$ and $\Tilde{z}$.
  2. An additional reconstruction loss term is added to the AE training objective, penalizing the difference between the downsampled image $\Tilde{x}$ and the decoder's output from the downsampled latent, $\Dec(\Tilde{z})$.
  3. The full loss function is:

    $L(x) = d(x, \Dec(z)) + \alpha d( \Tilde{x}, \Dec(\Tilde{z}) ) + \beta L_\text{KL}$

    where dd is a standard reconstruction loss (e.g., MSE + LPIPS), α\alpha controls the strength of the SE regularization (typically 0.25), and βLKL\beta L_\text{KL} is the optional VAE KL term (often set to 0 when using SE).

This method requires minimal code changes and only a brief fine-tuning period for the AE (e.g., 10k-20k steps). Experiments show that SE fine-tuning effectively reduces the high-frequency components in the latent space, making its spectrum more similar to that of RGB images.

The effectiveness of SE regularization is demonstrated by training Diffusion Transformer (DiT) models on top of various AEs (both vanilla and fine-tuned with/without SE) for image (ImageNet-1K 2562256^2) and video (Kinetics-700 17×256217 \times 256^2) generation. Key results include:

  • Improved Generation Quality: Significant reductions in standard metrics are observed. For ImageNet 2562256^2, FID dropped by 19% for DiT-XL/2 using FluxAE+SE compared to vanilla FluxAE. For Kinetics-700, FVD decreased by at least 44% (e.g., CogVideoX-AE+SE showed a 49% FVD drop with DiT-XL/2).
  • Efficiency: The improvements are achieved with only short AE fine-tuning.
  • Reconstruction Preservation: Unlike strong KL regularization, SE regularization generally maintains or slightly improves AE reconstruction quality across metrics like PSNR, SSIM, and LPIPS, while significantly boosting downstream LDM performance.
  • Robustness: Visualizations confirm that LDMs trained with SE-regularized AEs exhibit smoother denoising trajectories with fewer high-frequency artifacts early on. AEs trained with SE also show better reconstruction when high-frequency components are deliberately removed from their latents.

In conclusion, the paper highlights the importance of latent space "diffusability" for LDMs and identifies spectral mismatch (excessive high frequencies) as a key issue in modern AEs. The proposed scale equivariance regularization offers a practical, efficient, and effective way to improve this spectral alignment, leading to substantial gains in the quality of LDM-generated images and videos.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube