MaskBit: Embedding-free Image Generation via Bit Tokens (2409.16211v2)

Published 24 Sep 2024 in cs.CV and cs.LG

Abstract: Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters. The code for this project is available on https://github.com/markweberdev/maskbit.

PDF HTML Abstract

MaskBit: Embedding-free Image Generation via Bit Tokens

The paper "MaskBit: Embedding-free Image Generation via Bit Tokens" by Weber et al. proposes an innovative approach to class-conditional image generation by combining advancements in vector-quantized generative adversarial networks (VQGAN) and a novel bit token representation. The paper is divided into two main contributions: the modernization of the VQGAN architecture and the introduction of an embedding-free image generation method, MaskBit, which directly operates on bit tokens. This research provides insights into the architectural and training improvements for VQGANs and explores the potential benefits of using bit tokens for semantic image generation.

Key Contributions

Modernized VQGAN Architecture:
- A systematic exploration and enhancement of the VQGAN framework, named VQGAN+, which introduces several improvements to the model and training pipeline.
- Empirical analysis demonstrates a substantial improvement in the reconstruction quality, with the reconstruction FID (rFID) significantly reduced from 7.94 to 1.66 on the ImageNet $256\times256$ dataset.
Embedding-free Image Generation using Bit Tokens:
- Introduction of a binary quantization process, termed Lookup-Free Quantization (LFQ), where latent embeddings are projected into $K$ dimensions and then quantized based on their sign values.
- The novel MaskBit model, which leverages bit tokens, achieves state-of-the-art performance on the ImageNet $256\times256$ benchmark, with a generation FID (gFID) of 1.52 using a compact generator model of 305 million parameters.

Detailed Insights into Modernized VQGAN (VQGAN+)

The improvements to the VQGAN architecture encompass changes in model design, loss functions, and training procedures. The authors meticulously analyze each component, providing a transparent and reproducible framework, which includes:

Removal of self-attention layers and emphasizing purely convolutional designs for reduced complexity.
Symmetric architectures for generators and discriminators, updated learning rate schedules, and increased model capacity in terms of base channels.
Incorporation of group normalization, replacing average pooling with Gaussian blur kernels, and use of LeCAM loss for stabilizing adversarial training.
Introduction of Exponential Moving Average (EMA) for stabilizing training and entropy loss to facilitate generation.

These modifications result in a high-performing VQGAN model that offers public, high-fidelity reconstructions. The modernization efforts culminate in the VQGAN+ model, which establishes a reproducible and robust baseline for future research in image generation using VQGAN.

Embedding-free Image Generation: MaskBit

MaskBit capitalizes on the semantic richness of bit tokens, which encapsulate high-level structured information. This approach eliminates the need for traditional embedding tables in both the tokenizer and the transformer generator stages. Key facets of MaskBit’s design include:

Semantic Structuring of Bit Tokens: The bit flipping experiments indicate that close proximity in Hamming space translates to semantically similar images, underscoring the inherent structured representation of bit tokens.
Masked Bits Modeling: MaskBit adopts a unique method of representing masked tokens. By partitioning bit tokens into groups and masking individual bits within these groups, the model can effectively utilize remaining unmasked bits to infer missing information.
Empirical Performance: The model is evaluated on ImageNet $256\times256$ , with results indicating that MaskBit outperforms existing methods, demonstrating significant gains in gFID and balancing computational efficiency with high fidelity image generation.

Implications and Future Directions

The successful implementation of MaskBit highlights several implications:

Efficiency and Scalability: The embedding-free approach suggests pathways for more efficient and scalable image generation models, with fewer parameters dedicated to token embeddings.
Structured Semantic Representations: Bit tokens' ability to maintain high-level semantic coherence offers potential for applications beyond image generation, potentially influencing advancements in generative models' interpretability and controllability.
Broader Applications: While this work is focused on class-conditional image generation, the methodology can be extended to more complex tasks such as text-to-image synthesis and generative tasks involving multimodal data.

Future developments may explore:

Generalization to Larger Datasets: Testing MaskBit on larger datasets with diverse, non-centered images to evaluate its general robustness and applicability.
Integration with Other Generative Frameworks: Adapting MaskBit's embedding-free approach into other generative paradigms, such as diffusion models, to harness its structured semantic advantages.
Optimization of Grouping Strategies: Further refining the grouping of bit tokens to optimize performance and reduce computational overhead during training and inference.

Overall, the paper "MaskBit: Embedding-free Image Generation via Bit Tokens" provides a thorough and compelling exploration of embedding-free image generation, presenting notable advancements in model architecture and introducing an effective new approach to image generation that promises significant future impact in artificial intelligence research.