- The paper introduces AF-LDM, a latent diffusion model enhanced with an equivariance loss to maintain shift consistency even under fractional shifts.
- It combines ideal resampling techniques and alias-free modules in both VAE and U-Net architectures to suppress aliasing amplification during iterative denoising.
- The approach integrates Equivariant/Cross-frame attention to ensure global consistency, significantly improving performance in video editing and image-to-image translation.
Latent Diffusion Models (LDMs) have achieved impressive results in image synthesis but suffer from instability, where small shifts in input noise can lead to inconsistent outputs. This lack of shift-equivariance hinders their use in applications requiring temporal or spatial consistency, such as video editing and image-to-image translation. The paper "Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space" (2503.09419) investigates this issue and proposes Alias-Free LDM (AF-LDM) to enhance consistency by making the models more shift-equivariant.
The authors identify several reasons for the instability in standard LDMs (like Stable Diffusion, SD):
- Aliasing amplification: The training process of the VAE encoder-decoder can amplify aliasing effects despite initial alias-free designs.
- Accumulated aliasing: The iterative denoising process in the U-Net involves multiple steps, allowing aliasing to accumulate over time.
- Non-equivariant self-attention: Standard self-attention modules, sensitive to global positions, are not inherently shift-equivariant under arbitrary (non-circular) shifts.
To address these challenges, AF-LDM introduces improvements in two main areas:
- Continuous Latent Representation via Equivariance Loss:
- The goal is to ensure that the latent space represents a band-limited continuous signal, making fractional shifts well-defined and predictable.
- The model architecture incorporates anti-aliasing techniques inspired by StyleGAN3 (2106.11237) and Alias-Free Convnets (AFC) (2303.05253):
- Ideal Downsampling: Replaces strided convolution with convolution + ideal low-pass filter + nearest downsampling. The low-pass filter is implemented in the Fourier domain.
- Ideal Upsampling: Uses zero-padding + convolution with a sinc kernel, also implemented as multiplication in the Fourier domain.
- Filtered Nonlinearity: Wraps nonlinear activation functions between ideal upsampling and downsampling to suppress high frequencies introduced by the nonlinearity.
- These alias-free modules are applied to both the VAE and the U-Net.
- Crucially, an equivariance loss is added to the training objective. This loss directly penalizes deviations from shift-equivariance:
- For a network f with scaling factor k, the loss is L=∣∣f(TΔ(x))−Tk⋅Δ(f(x))∣∣22, where TΔ is a fractional shift operator (Fourier shift).
- For the VAE: LeqVAE is computed separately for the encoder E and decoder D using cropped integer pixel shifts Δ=(Δx,Δy) on the input image x and corresponding fractional shifts Δ/k on the latent z. A valid mask MΔ is used to exclude padded regions from the loss calculation.
- For the U-Net: LeqLDM is computed using fractional circular shifts TΔcir on the noisy latent zt. Circular shift is used for the input latent to avoid large padding effects in the iterative process, but a cropped mask MΔ is still applied to the output difference.
- This loss encourages the network to only utilize low-frequency information that is consistent across shifts, preventing aliasing amplification during training.
- Equivariant Attention (EA) / Cross-Frame Attention (CFA):
- Standard self-attention fails to be fully equivariant under non-circular shifts because the key and value pools change with the shifted input.
- Equivariant Attention addresses this by fixing the keys and values from a reference frame (xr) while allowing the queries to come from the shifted frame (xs). The operation becomes EA(xr,xs)=softmax(xsWQ(xrWK)⊤)xrWV.
- This mechanism is equivalent to Cross-Frame Attention (CFA), commonly used in video diffusion models during inference.
- The authors highlight that using CFA is critical not just in inference but also during training (specifically, within the U-Net equivariance loss calculation) to prevent the loss from focusing on fixing the attention mechanism itself rather than suppressing aliasing.
Implementation Details:
- AF-LDM is built by modifying existing Stable Diffusion models. AF-VAE is initialized from SD VAE (kl-f8) and retrained on ImageNet. AF-LDM and AF-SD U-Nets are initialized from scratch or SD V1.5 and retrained on FFHQ and LAION Aesthetic 6.5+, respectively.
- Alias-free modules replace standard resampling (bilinear/nearest) and wrap nonlinearities in the VAE and U-Net.
- Training involves adding the proposed equivariance losses (LeqVAE and LeqLDM) to the standard reconstruction/KL/GAN loss for VAE and diffusion loss for LDM/SD.
- Fourier shifts are used for implementing TΔ. Integer pixel shifts (Δx,Δy) for VAE input are sampled from [−83H,83W]. Fractional shifts (Δx/k,Δy/k) are used for latent spaces.
Practical Applications and Evaluation:
- Shift PSNR (SPSNR): A key metric to evaluate shift-equivariance. Higher SPSNR indicates better equivariance. It's calculated as PSNR between the output of a shifted input and the shifted output of the original input: SPSNRf(x)=PSNR(f(TΔ(x)),Tk⋅Δ(f(x))).
- Ablation Studies: Experiments show that while alias-free architecture helps, equivariance loss is crucial to maintain high SPSNR throughout training, especially for the U-Net across denoising steps (Fig. 4). Equivariant attention (CFA) combined with equivariance loss further improves results.
- Warping-Equivariant Video Editing: AF-SD's enhanced shift-equivariance translates to better robustness under irregular shifts like optical flow warping (measured by warping PSNR/MSE). A simple video editing method is proposed: invert video frames to latents (using CFA for consistency), then regenerate from inverted latents with a new prompt (again using CFA). This method achieves more consistent results than standard SD without explicit latent warping.
- Image-to-image Translation: AF-LDM improves consistency in tasks like 4× super-resolution (using latent I2SB) and normal estimation (using AF-YOSO). Shifted inputs yield significantly more consistent outputs with alias-free models compared to baselines.
Limitations:
- The dependence on a reference frame for Equivariant Attention (CFA) means handling occlusions or new objects entering the scene is challenging, potentially leading to inconsistencies in these areas, similar to flow-based methods.
In summary, AF-LDM provides a practical approach to building diffusion models with enhanced shift-equivariance. By combining alias-free architectural design with a novel equivariance loss and strategically applying cross-frame attention during training and inference, it achieves significantly more consistent results for various generative and image-to-image tasks compared to standard LDMs. The code and models are available at https://github.com/SingleZombie/AFLDM.