Masked Autoencoders Are Effective Tokenizers for Diffusion Models (2502.03444v2)

Published 5 Feb 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.

Authors (10)

Hao Chen (1006 papers)
Yujin Han (11 papers)
Fangyi Chen (14 papers)
Xiang Li (1003 papers)
Yidong Wang (43 papers)
Jindong Wang (150 papers)
Ze Wang (91 papers)
Zicheng Liu (153 papers)
Difan Zou (71 papers)
Bhiksha Raj (180 papers)

Summary

The paper introduces MAETok, a masked autoencoder-based tokenizer that simplifies latent space structure in diffusion models.
The method outperforms traditional VAEs by reducing diffusion losses and improving image synthesis on benchmarks like ImageNet.
A lower token count (128 tokens) enables a 76x training speedup and 31x higher inference throughput, demonstrating practical efficiency.

Insightful Overview of "Masked Autoencoders Are Effective Tokenizers for Diffusion Models"

In the academic exploration of image synthesis, the paper "Masked Autoencoders Are Effective Tokenizers for Diffusion Models" contributes a significant theoretical and empirical investigation into the latent space utilized by diffusion models. The authors argue that masked autoencoders (MAEs) can be optimal tokenizers for these models, presenting the MAETok approach as a novel solution. The research diverges from the traditional reliance on variational autoencoders (VAEs) and instead emphasizes that a well-structured latent space, characterized by fewer Gaussian Mixture modes, is more beneficial for diffusion models.

Theoretical and Empirical Foundations

The paper's central thesis is grounded in both theoretical analysis and empirical validation. The authors scrutinize the latent distribution's structural properties and establish a correlation between fewer latent space modes and better generation quality. They use Gaussian Mixture Models (GMM) to assess the latent space structure, demonstrating that a more discriminative latent distribution correlates with lower diffusion model losses. This insight challenges the existing paradigm that variational regularization, as employed in VAEs, is essential for effective latent representation. Through rigorous analysis, the authors contend that fewer GMM modes, hence a less complex latent space, facilitate more effective training and sampling in diffusion models.

Introduction of MAETok

Motivated by their analysis, the authors introduce MAETok, a masked autoencoder model. Unlike conventional VAEs, MAETok employs mask modeling during training to learn a rich and structured latent space. The method leverages a simple autoencoder architecture but enhances it with masked modeling, enabling it to maintain high semantic richness and reconstruction fidelity. By applying mask modeling to the encoder and predicting multiple target features through auxiliary shallow decoders, MAETok achieves state-of-the-art generation performance with significantly reduced computational overhead.

Robust Empirical Validation

The empirical results presented in the paper validate the effectiveness of the MAETok model. On the ImageNet benchmark, MAETok demonstrates superior performance in generating high-quality images compared to existing state-of-the-art models. Specifically, it achieves notable improvements in generation FID (gFID) and inception scores (IS) while using substantially fewer tokens (128 tokens compared to the typical 256 or 1024). This reduction in token count translates to a 76x faster training time and a 31x higher inference throughput for generating 512x512 images, underscoring the model's efficiency.

Practical and Theoretical Implications

The research presents both immediate practical applications and theoretical advancements. Practically, the MAETok model can substantially reduce the computational resources required for high-resolution image synthesis, making it attractive for real-time applications where speed and efficiency are critical. Theoretical implications include a reassessment of the role of variational constraints in generating effective latent representations, suggesting that future research in diffusion models should prioritize latent space structure over traditional regularization techniques.

Future Directions

The findings in this paper pave the way for future research focused on improving latent space discriminability and structuring for generative models. Additionally, the insights on mask modeling suggest further exploration into self-supervised learning techniques for generative purposes. The community may benefit from investigating alternative architectures and objective functions that align with the principles outlined in this work to further optimize generative model performance.

In conclusion, this paper provides a significant contribution to our understanding of efficient tokenizer design for diffusion models. By shifting focus from variational constraints to the discriminative properties of the latent space, the authors offer a valuable perspective that could influence the direction of future research in high-resolution image synthesis.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1887346863810224185

https://twitter.com/DBralios/status/1888683930737360908

https://twitter.com/zhouyifan1107/status/1894217421562810856

YouTube

Show All Videos