Model capacity explanation for DC-AE-f64 scaling benefits

Ascertain whether the observation that larger diffusion transformer models benefit more from DC-AE-f64 than smaller models is explained by DC-AE-f64’s larger latent channel count relative to SD-VAE-f8, which requires greater model capacity to achieve optimal performance.

Background

In ImageNet 512×512 experiments with UViT variants, DC-AE-f64p1 underperforms SD-VAE-f8p2 on the smaller UViT-S but outperforms it on the larger UViT-H, suggesting a dependence on backbone capacity. The authors explicitly conjecture the cause: DC-AE-f64 has more latent channels than SD-VAE-f8, potentially requiring larger model capacity to fully leverage the latent representation.

Validating this conjecture would clarify how latent channel dimensionality interacts with diffusion transformer capacity, informing architecture scaling strategies for latent diffusion models using high spatial-compression autoencoders.

References

We conjecture it is because DC-AE-f64 has a larger latent channel number than SD-VAE-f8, thus needing more model capacity \citep{esser2024scaling}.

— Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models (2410.10733 - Chen et al., 14 Oct 2024) in Section 4.2 (Latent Diffusion Models), ImageNet 512×512 paragraph

Model capacity explanation for DC-AE-f64 scaling benefits

Background

References

Related Problems