Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Pytorch Reproduction of Masked Generative Image Transformer (2310.14400v1)

Published 22 Oct 2023 in cs.CV

Abstract: In this technical report, we present a reproduction of MaskGIT: Masked Generative Image Transformer, using PyTorch. The approach involves leveraging a masked bidirectional transformer architecture, enabling image generation with only few steps (8~16 steps) for 512 x 512 resolution images, i.e., ~64x faster than an auto-regressive approach. Through rigorous experimentation and optimization, we achieved results that closely align with the findings presented in the original paper. We match the reported FID of 7.32 with our replication and obtain 7.59 with similar hyperparameters on ImageNet at resolution 512 x 512. Moreover, we improve over the official implementation with some minor hyperparameter tweaking, achieving FID of 7.26. At the lower resolution of 256 x 256 pixels, our reimplementation scores 6.80, in comparison to the original paper's 6.18. To promote further research on Masked Generative Models and facilitate their reproducibility, we released our code and pre-trained weights openly at https://github.com/valeoai/MaskGIT-pytorch/

An Assessment of the PyTorch Reproduction of the Masked Generative Image Transformer

The paper under review presents a focused reproduction of the Masked Generative Image Transformer (MaskGIT) using the PyTorch framework. Originally conceptualized in Chang et al.'s work, MaskGIT has developed as a significant approach for image generation within fewer steps compared to traditional auto-regressive models. By achieving a 64-fold increase in speed, this reproduction marks a distinctive advancement in the domain of generative models, offering novel insights for the continued development of Masked Generative Models (MGMs).

Core Contributions and Methodology

Adopting a VQGAN architecture, the reproduction takes explicit advantage of discretization processes to reduce an image to a finite set of visual tokens. Utilizing a bidirectional transformer architecture (modeled similarly to BERT), this process permits the rapid unmasking of tokens within parallel, contrasting with the iterative methodology of auto-regressive models. MaskGIT's PyTorch implementation successfully attains and sometimes surpasses the original reported results from Chang et al. For instance, the replication achieves a FID score of 7.59 on ImageNet at 512x512 resolution, with a further improved score of 7.26 upon hyperparameter adjustments. These effectively matched and slightly outperformed the original results in certain configurations, reflecting both the rigor and potential of the implemented optimizations.

Experimental Setup and Results

The reproduction of MaskGIT in PyTorch utilizes a comprehensive experimental setup, including data augmentation and a meticulous exploration of hyperparameter spaces. The authors provide an in-depth evaluation of model performance across different sampler configurations, focusing specifically on the Gumbel temperature, softmax temperature, CFG (classifier-free guidance), scheduler, and number of inference steps. The experiments highlight key findings, such as optimal results with an arccos scheduler and 15 decoding steps, particularly at a 512x512 resolution.

Moreover, the results underscore a noteworthy balance between sample fidelity and quality. At 256x256 resolution, the approach produces favorable outcomes with an FID of 6.80, approximating the original paper's scores closely, albeit emphasizing sample quality. At 512x512, the choice of inference steps and injected noise variable in the Gumbel trick lead to marked improvements in diversity metrics. This highlights the robustness of the reproduction methodology, whereby the configurations are strategically tuned to achieve high fidelity while maintaining computational efficiency.

Implications and Forward-Looking Perspectives

The significance of this work extends into multiple dimensions. Practically, the open availability of the PyTorch implementation, alongside the release of the training code and pre-trained models, acts as an invaluable resource for researchers seeking to explore MGMs. It broadens accessibility for a community that predominantly interfaces with PyTorch, addressing constraints previously faced due to prevalent JAX implementations, and encourages rigorous experimentation and future improvements on model variations and different generative modalities.

Theoretically, the reproduction sets a precedent for examining masked generative modeling potential further. It accentuates the proximity of these models to successful deep learning paradigms like masked autoencoders and discrete diffusion models, inviting further cross-pollination of methodologies. Key to future developments could be extending these models to incorporate novel generative capabilities such as video, text imagery, or multi-modal generations.

Conclusion

This paper exemplifies the detailed process of methodically reproducing and even enhancing an existing model's efficacy within the machine learning domain. By adhering closely to the foundational features of MaskGIT while exploiting niche optimizations, it upholds the significant impression that MGMs can leave on the landscape of generative modeling. As accessibility and replicability remain central tenets within this field, such diligent reproduction endeavors will no doubt catalyze further innovation in understanding and optimizing masked generation models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Victor Besnier (9 papers)
  2. Mickael Chen (31 papers)
Citations (8)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com