MaskGIT: Masked Generative Image Transformer (2202.04200v1)

Published 8 Feb 2022 in cs.CV

Abstract: Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.

PDF Abstract

MaskGIT: Masked Generative Image Transformer

The paper introduces a novel paradigm for image synthesis, proposing the Masked Generative Image Transformer (MaskGIT). This model diverges from conventional autoregressive methods by employing a bidirectional transformer decoder to synthesize images. Unlike existing models that generate images token-by-token following a raster scan ordering, MaskGIT predicts all image tokens simultaneously and refines them iteratively, significantly enhancing both efficiency and quality.

Overview of the MaskGIT Approach

The MaskGIT methodology is fundamentally built on the concept of masked visual token modeling (MVTM). During training, the model is exposed to masked versions of visual tokens derived from images. The task of the model is to predict the original tokens, leveraging a training strategy reminiscent of BERT for natural language processing. At inference, MaskGIT does not proceed sequentially; instead, it starts with a fully masked set of tokens, iteratively predicts the image, and refines its predictions over a constant number of steps. The mask scheduling function, a central contribution of the paper, dynamically adjusts the proportion of masked tokens at each iteration, optimizing the generation process and enhancing output quality.

Performance and Efficiency

Empirically, MaskGIT establishes itself as superior in generating high-quality images on the ImageNet dataset compared to state-of-the-art models like VQGAN. Specifically, MaskGIT demonstrates a substantial decrease in Fréchet Inception Distance (FID) and an increase in inception score (IS), indicating improved diversity and realism in generated images. The model achieves up to a 64-fold speed improvement in decoding time compared to autoregressive models, a testament to its innovative parallel processing approach. The elimination of the autoregressive bottleneck not only accelerates generation but also contributes to improved sample quality and diversity.

Versatility in Image Manipulation

Beyond standard generation tasks, MaskGIT's architecture allows it to extend to a variety of image editing applications, including inpainting, outpainting, and class-conditional editing. These tasks, considered challenging for models rooted in autoregressive principles, are naturally addressed by MaskGIT's bidirectional and iterative framework. The model capably refines entire images or specific regions selectively, demonstrating robust contextual integration and manipulation capabilities.

Implications and Future Directions

The implications of MaskGIT's approach signal considerable potential for advancements in image synthesis and manipulation tasks. Its theoretical underpinning, grounded in leveraging the full context, suggests broader applications in fields where generative capabilities are pivotal. From practical perspectives, industries that benefit from realistic graphics modeling—such as gaming, virtual reality, and digital art—could particularly leverage this technology.

In terms of future developments in artificial intelligence, the principles underpinning MaskGIT could inspire similar parallelization strategies in other generative tasks across media forms, possibly extending beyond vision to multi-modal generation scenarios. Further exploration into refining masking strategies and adapting the approach for tasks requiring even higher levels of detail fidelity might unlock additional applications and improvements in model generality.

In conclusion, MaskGIT represents a significant step in generative model design, balancing the dual imperatives of efficiency and quality, and setting the stage for future research that might transgress conventional boundaries of image synthesis through smart, non-linear methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Huiwen Chang (28 papers)
Han Zhang (338 papers)
Lu Jiang (90 papers)
Ce Liu (51 papers)
William T. Freeman (114 papers)

Citations (484)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos