Optimization difficulty behind accuracy gap of high spatial-compression autoencoders
Determine whether the reconstruction accuracy gap between high spatial-compression autoencoders (e.g., variants derived from Stable Diffusion VAE with f32–f64 compression) and low spatial-compression autoencoders (e.g., SD-VAE-f8) is primarily caused by optimization difficulty that prevents high spatial-compression autoencoders from reaching good local optima in the parameter space, even when total latent size is matched via space-to-channel transformation and learning capacity is increased by stacking additional encoder/decoder stages.
Sponsor
References
Based on this finding, we conjecture the accuracy gap comes from the model learning process: while we have good local optimums in the parameter space, the optimization difficulty hinders high spatial-compression autoencoders from reaching such local optimums.