Optimization difficulty behind accuracy gap of high spatial-compression autoencoders

Determine whether the reconstruction accuracy gap between high spatial-compression autoencoders (e.g., variants derived from Stable Diffusion VAE with f32–f64 compression) and low spatial-compression autoencoders (e.g., SD-VAE-f8) is primarily caused by optimization difficulty that prevents high spatial-compression autoencoders from reaching good local optima in the parameter space, even when total latent size is matched via space-to-channel transformation and learning capacity is increased by stacking additional encoder/decoder stages.

Background

The paper investigates why increasing the spatial compression ratio of autoencoders degrades reconstruction performance. Through ablations moving from f8 to f64 while keeping total latent size constant via space-to-channel operations and adding encoder/decoder stages to increase capacity, the authors observe that reconstruction accuracy nonetheless worsens at higher compression.

This empirical observation suggests that added downsample/upsample blocks perform worse than a simple non-parametric space-to-channel transform. The authors explicitly conjecture that the underlying issue is optimization difficulty, positing that good local optima exist but training high spatial-compression autoencoders fails to reach them. This conjecture motivates the proposed Residual Autoencoding design intended to ease optimization.

References

Based on this finding, we conjecture the accuracy gap comes from the model learning process: while we have good local optimums in the parameter space, the optimization difficulty hinders high spatial-compression autoencoders from reaching such local optimums.

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models (2410.10733 - Chen et al., 14 Oct 2024) in Section 3.1 (Motivation)