MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer (2303.14389v2)

Published 25 Mar 2023 in cs.CV

Abstract: Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. We further improve MDT with a more efficient macro network structure and training strategy, named MDTv2. Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT.

PDF Abstract

Overview of "MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer"

The paper presents MDTv2, a Masked Diffusion Transformer designed to enhance the capabilities of diffusion probabilistic models (DPMs) in image synthesis tasks. Recognizing the challenges DPMs face in contextual reasoning and object part relations, the authors propose a masked latent modeling approach. This method aims to improve learning efficiency and synthesis performance by incorporating a transformer-based diffusion model that explicitly learns relations among object parts.

Key Contributions

Masked Latent Modeling: MDT introduces a novel approach that involves masking certain tokens during training to enhance contextual learning among semantic parts. This involves an asymmetric diffusion transformer that predicts masked tokens from unmasked ones. By focusing on contextual relations, MDT addresses the slow training convergence commonly associated with DPMs.
Macro-Network Structure in MDTv2: Building on the initial MDT, MDTv2 incorporates a more efficient network structure with long-shortcuts in the encoder and dense input-shortcuts in the decoder. This structural enhancement significantly accelerates the convergence speed, improving both training efficiency and synthesis quality.
Training Strategy Enhancements: The paper also discusses improved training strategies, such as the use of the Adan optimizer, a wider masking ratio range, and timestep-adapted loss weights, contributing to faster training and superior image synthesis.

Experimental Results

MDTv2 sets a new state-of-the-art (SOTA) in image synthesis on the ImageNet dataset with a FID score of 1.58. It demonstrates more than a tenfold improvement in training speed over previous SOTA models like DiT. These results highlight the effectiveness of the mask latent modeling scheme and the enhanced network structure in promoting faster convergence and higher performance.

Implications and Future Directions

The findings underscore the potential of transformer-based architectures in the field of generative modeling, particularly in overcoming the limitations of traditional DPMs. MDTv2's approach to contextual learning and macro-network design could inspire further research on efficient diffusion model training.

The implications are significant for applications requiring rapid generation times and high synthesis fidelity. Future work might explore the scalability of MDTv2 to larger datasets and different domains, as well as its integration with other AI models for more nuanced tasks.

Overall, MDTv2 represents a substantial advancement in image synthesis, offering insights and methodologies that could shape the development of next-generation generative models.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shanghua Gao (20 papers)
Pan Zhou (220 papers)
Ming-Ming Cheng (185 papers)
Shuicheng Yan (275 papers)

Citations (111)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - sail-sg/MDT: Masked Diffusion Transformer is the SOTA for image synthesis. (ICCV 2023) (503 stars)

Tweets

https://twitter.com/GrigoryEvko/status/1772369044936532129

YouTube

Show All Videos