LiteVAE: Efficient VAE for Latent Diffusion
- The paper introduces LiteVAE, a lightweight variational autoencoder that leverages fixed three-level Haar wavelet transforms and compact UNets to generate high-quality latent representations.
- It achieves up to a six-fold reduction in encoder parameters with lower memory usage and faster training while matching or surpassing metrics like rFID, LPIPS, PSNR, and SSIM.
- The design supports scalable latent diffusion modeling through a two-phase training process, making it ideal for resource-constrained environments and rapid prototyping.
LiteVAE is a lightweight and computationally efficient variational autoencoder design for latent diffusion models (LDMs), combining multi-level discrete wavelet transforms with lightweight UNet architectures. Unlike conventional VAEs deployed in LDM frameworks such as Stable Diffusion, LiteVAE achieves substantial reductions in model parameters, memory footprint, and training time while matching or surpassing reconstruction quality metrics including rFID, LPIPS, PSNR, and SSIM (Sadat et al., 2024).
1. Wavelet-based Encoder Architecture
LiteVAE introduces a fixed three-level 2D discrete Haar wavelet transform (DWT) for initial image decomposition. Given an input image , sequential application of the Haar DWT produces four frequency sub-bands at each level: LL (low-low), LH (low-high), HL (high-low), and HH (high-high). Each level involves filtering and downsampling operations, resulting in sub-band dimensions at level .
At each level, the four sub-bands are concatenated along the channel dimension and processed by a lightweight UNet (denoted ) with no spatial resampling, generating intermediate feature maps . These are further aggregated by , an additional UNet, which concatenates , , and in the channel dimension. The aggregated output is the latent representation .
Decoding uses the same fully convolutional, “style”-based decoder architecture as Stable Diffusion, primarily distinguished by the replacement of GroupNorm with Self-Modulated Convolutions (SMC). The inverse DWT reconstructs the image from latent features using iterative upsampling and filtering over the recovered sub-bands.
2. Model-Size Variants and Norm Innovations
LiteVAE provides a family of size-adjustable encoder variants:
| Architecture | Encoder Params (M) |
|---|---|
| LiteVAE-S | 1.03 |
| LiteVAE-B | 6.75 |
| LiteVAE-M | 32.75 |
| LiteVAE-L | 41.42 |
| SD VAE | 34.16 |
The base variant (LiteVAE-B) achieves a six-fold reduction in encoder parameters compared to the canonical Stable Diffusion VAE. Larger variants (LiteVAE-M, LiteVAE-L) extend capacity beyond the original VAE.
The SMC normalization replaces GroupNorm in the decoder, using per-channel learned scales to apply
This promotes balanced feature map scaling and delivers demonstrable improvements in downstream reconstruction quality.
3. Mathematical Objective and Loss Functions
The LiteVAE training objective extends standard VAE formulations to include adversarial and high-frequency reconstruction terms:
The reconstruction loss comprises:
- L1 loss:
- Perceptual loss (LPIPS):
- Wavelet-based Charbonnier loss on high-frequency sub-bands:
- Gaussian high-frequency loss:
The regularization schedule uses a constant factor for the adversarial loss, eschewing adaptive weighting mechanisms. The adversarial discriminator is implemented as a UNet, in line with methods by Schonfeld et al. (2020), replacing the more common PatchGAN backbone.
4. Training Methodology and Computational Regimen
Training proceeds in two distinct phases: initial pretraining at resolution for 100,000 steps, followed by finetuning at resolution for 50,000 steps. This regimen matches or surpasses full-resolution training in quality and reduces wall-clock time by approximately half.
Configurations utilize the Adam optimizer with learning rate , , batch size 16 on two GPUs, and mixed precision. No additional augmentation beyond standard flips is employed. The loss configuration integrates wavelet and Gaussian high-frequency components, enhancing reconstruction of fine details.
5. Quantitative Results and Performance Analysis
LiteVAE-B consistently matches or improves upon Stable Diffusion VAE across standard datasets (FFHQ, ImageNet) and latent sizes. Key metrics include reduced FID (rFID), improved perceptual similarity (LPIPS), higher PSNR, and higher SSIM. Representative empirical results are presented below:
| Dataset/Latent | Model | rFID | LPIPS | PSNR | SSIM |
|---|---|---|---|---|---|
| FFHQ 128 (16×16×4) | VAE | 0.88 | 0.089 | 28.08 | 0.85 |
| LiteVAE-B | 0.74 | 0.085 | 28.36 | 0.85 | |
| FFHQ 256 (32×32×4) | VAE | 0.47 | 0.109 | 28.16 | 0.81 |
| LiteVAE-B | 0.41 | 0.117 | 28.33 | 0.82 | |
| ImageNet 128 (16×16×4) | VAE | 4.54 | 0.164 | 24.25 | 0.69 |
| LiteVAE-B | 4.40 | 0.164 | 24.49 | 0.71 |
Efficiency analysis for base models indicates LiteVAE-B utilizes only 6.75M parameters, 3.16 GB GPU memory, and achieves 129 images/sec throughput, compared to 34.16M params, 8.86 GB, and 68 images/sec for standard VAE—demonstrating a fivefold parameter reduction, 2× throughput, and 64% lower memory consumption.
6. Comparative Analysis with Conventional VAEs
The base LiteVAE-B variant (6.75M parameters) matches or surpasses conventional VAE performance over all principal metrics. Larger variants outclass VAEs of equivalent or greater complexity, with LiteVAE-L (41.42M params) achieving rFID=0.74 vs. 0.95, LPIPS=0.062 vs. 0.069, PSNR=29.94 vs. 29.25, and SSIM=0.88 vs. 0.86 for 34M-parameter VAEs. In contrast, naive parameter reduction of VAEs to 6.75M (VAE-Small) entails severe quality degradation (rFID increases to 5.27), whereas LiteVAE-B preserves high fidelity (rFID=4.40).
This suggests fixed wavelet decompositions and compact UNets provide a scalable avenue for efficient latent modeling in generative LDM workflows. A plausible implication is greater applicability for resource-constrained deployments and rapid model iteration.
7. Context and Significance
LiteVAE advances the design space of autoencoders underpinning latent diffusion models by integrating deterministic multi-scale frequency analysis with neural feature extraction, accommodating parameter-efficient training and deployment. By leveraging Haar DWT and eliminating spatially resampling operations, LiteVAE achieves high-quality reconstructions and supports multiple capacity regimes without compromising efficiency (Sadat et al., 2024).
The demonstrated gains in computational and representational efficiency, combined with robust empirical validation, position LiteVAE as a viable alternative to conventional VAE architectures for high-resolution LDMs. Larger LiteVAE variants further extend quality boundaries for systems requiring higher capacity, underscoring the importance of architectural innovation at the interface of signal processing and deep generative modeling.