- The paper introduces scale equivariance regularization to reduce high-frequency imbalances in autoencoder latent spaces used by latent diffusion models.
- It achieves significant quality improvements, including a 19% drop in FID for images and up to a 49% decrease in FVD for videos with minimal fine-tuning.
- The method maintains or slightly improves reconstruction quality over standard KL regularization, leading to smoother denoising trajectories in diffusion processes.
This paper addresses the interaction between autoencoders (AEs) and diffusion backbones in Latent Diffusion Models (LDMs), focusing on an underexplored aspect called "diffusability"—how suitable an AE's latent space is for the diffusion process. The authors perform a spectral analysis using the Discrete Cosine Transform (DCT) on the latent spaces of modern AEs (like FluxAE, CosmosTokenizer, CogVideoX-AE, LTX-AE) and find they contain excessive high-frequency components compared to natural RGB images. This issue is more pronounced in AEs with larger bottleneck channel sizes, which are often used to improve reconstruction quality.
The core hypothesis is that these unnatural high-frequency components interfere with the inherent coarse-to-fine synthesis process of diffusion models, thereby degrading the final generation quality. The paper also shows that standard KL divergence regularization used in Variational Autoencoders (VAEs) can worsen this spectral imbalance by introducing more high-frequency noise.
To address this, the paper proposes a simple yet effective regularization strategy called Scale Equivariance (SE). The goal is to align the spectral properties of the latent space with the RGB space. This is achieved by enforcing scale equivariance in the AE's decoder during a short fine-tuning phase:
- Both the input image x and its corresponding latent representation $z = \Enc(x)$ are downsampled (e.g., using 2x or 4x bilinear downsampling) to get $\Tilde{x}$ and $\Tilde{z}$.
- An additional reconstruction loss term is added to the AE training objective, penalizing the difference between the downsampled image $\Tilde{x}$ and the decoder's output from the downsampled latent, $\Dec(\Tilde{z})$.
- The full loss function is:
$L(x) = d(x, \Dec(z)) + \alpha d( \Tilde{x}, \Dec(\Tilde{z}) ) + \beta L_\text{KL}$
where d is a standard reconstruction loss (e.g., MSE + LPIPS), α controls the strength of the SE regularization (typically 0.25), and βLKL is the optional VAE KL term (often set to 0 when using SE).
This method requires minimal code changes and only a brief fine-tuning period for the AE (e.g., 10k-20k steps). Experiments show that SE fine-tuning effectively reduces the high-frequency components in the latent space, making its spectrum more similar to that of RGB images.
The effectiveness of SE regularization is demonstrated by training Diffusion Transformer (DiT) models on top of various AEs (both vanilla and fine-tuned with/without SE) for image (ImageNet-1K 2562) and video (Kinetics-700 17×2562) generation. Key results include:
- Improved Generation Quality: Significant reductions in standard metrics are observed. For ImageNet 2562, FID dropped by 19% for DiT-XL/2 using FluxAE+SE compared to vanilla FluxAE. For Kinetics-700, FVD decreased by at least 44% (e.g., CogVideoX-AE+SE showed a 49% FVD drop with DiT-XL/2).
- Efficiency: The improvements are achieved with only short AE fine-tuning.
- Reconstruction Preservation: Unlike strong KL regularization, SE regularization generally maintains or slightly improves AE reconstruction quality across metrics like PSNR, SSIM, and LPIPS, while significantly boosting downstream LDM performance.
- Robustness: Visualizations confirm that LDMs trained with SE-regularized AEs exhibit smoother denoising trajectories with fewer high-frequency artifacts early on. AEs trained with SE also show better reconstruction when high-frequency components are deliberately removed from their latents.
In conclusion, the paper highlights the importance of latent space "diffusability" for LDMs and identifies spectral mismatch (excessive high frequencies) as a key issue in modern AEs. The proposed scale equivariance regularization offers a practical, efficient, and effective way to improve this spectral alignment, leading to substantial gains in the quality of LDM-generated images and videos.