- The paper replicates and open-sources a Lookup-Free Quantizer that nearly matches MAGVIT2’s performance on ImageNet benchmarks.
- The paper introduces asymmetric token factorization and next sub-token prediction to enhance large-vocabulary handling in auto-regressive models.
- The paper demonstrates state-of-the-art visual reconstruction and improved image generation metrics compared to previous methods.
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
Introduction
The paper presents Open-MAGVIT2, a collection of open-source, auto-regressive image generation models with parameter sizes ranging from 300 million to 1.5 billion. The project aims to replicate and open-source Google's MAGVIT-v2 tokenizer, which features an extensive codebook of 218 codes, delivering state-of-the-art performance on ImageNet 256×256. Additionally, the authors address the scalability issues in plain auto-regressive models by introducing asymmetric token factorization and "next sub-token prediction" to enhance token interaction and generation quality.
Background
Auto-regressive Visual Generation Models
Auto-regressive models have demonstrated dominance in natural language generation, and recent research has extended their applicability to visual generation. These approaches typically involve vector quantization for image tokenization and de-tokenization, followed by the application of auto-regressive transformers for discrete image token sequence modeling. Despite their promise, visual generation quality in auto-regressive models often lags behind diffusion-based methods, largely due to limited tokenizer performance.
MAGVIT-v2 Tokenizer
MAGVIT-v2 introduces a highly efficient visual tokenizer that achieves superior generation quality via a Lookup-Free Quantizer with a remarkably large codebook. However, its closed-source nature limits broader academic and practical advancements.
Contributions
Replication of Visual Tokenizer
The primary contribution of this paper is the replication of the advanced Lookup-Free Quantizer proposed by MAGVIT-v2. The replicated tokenizer achieves a reconstruction performance very close to that of MAGVIT-v2 (1.18 vs. 1.15 rFID on ImageNet 128 × 128) and outperforms previous methods on the ImageNet benchmark. This effort democratizes access to a state-of-the-art visual tokenizer, facilitating further research and innovation in the field.
Integrating a Super-Large Codebook
The authors also explore integrating a super-large codebook within plain auto-regressive visual generation models. Instead of adhering to MAGVIT-v2's vision-oriented design (e.g., mask generative methods), the paper leverages a large codebook in vanilla auto-regressive generation. This approach includes factorizing the vocabulary into two sub-vocabularies of different sizes and introducing "next sub-token prediction" to improve sub-token interaction and generation quality.
Methodology
Visual Tokenizer
The visual tokenizer follows the architecture proposed in MAGVIT-v2 and comprises:
- A CNN-based encoder projecting the input image into a feature map.
- A Lookup-Free Quantizer that maps each feature vector to the closest entry in a super-large codebook, represented as a Cartesian product of single-dimensional variables.
- A decoder that reconstructs the image from quantized features, incorporating Adaptive GroupNorm to integrate quantized vectors with residual block outputs for enhanced reconstruction quality.
The auto-regressive transformer models are designed to handle the large vocabulary efficiently:
- Asymmetric Token Factorization: This assists models in predicting with a super-large vocabulary by factorizing it into multiple sub-vocabularies, each embedded individually, and summing them as transformer inputs.
- Next Sub-Token Prediction: To model both intra- and inter-token dependencies, the auto-regressive transformer sequentially predicts each sub-token's conditional probability using context-enriched vectors.
Results
Visual Reconstruction
Open-MAGVIT2 achieves state-of-the-art performance in visual reconstruction on ImageNet 256×256, outperforming other tokenizers such as VQGAN and LlamaGen, particularly in detailed image regeneration and precise facial and text reconstruction.
Visual Generation
In the auto-regressive visual generation tasks, Open-MAGVIT2 demonstrates superiority and scalability, yielding better FID and IS metrics than competing models. This performance underscores the potential of leveraging a super-large codebook in auto-regressive models.
Implications and Future Work
The replication and open-sourcing of the Lookup-Free Quantizer present significant practical and theoretical implications as it democratizes access to high-performance visual tokenization technology, potentially accelerating advancements in scalable and high-quality auto-regressive visual generation models. Future research may focus on extending this framework to more extensive training datasets and larger models to fully exploit the enhanced representational capacity of a super-large codebook, thereby impacting broader applications such as text-conditional image and video generation.
Acknowledgments
The authors acknowledge contributions and discussions from their colleagues, and they emphasize their open-source effort as a means to foster more innovative and creative works in the field of auto-regressive visual generation. Their work highlights the importance of accessible high-performance models to advancing research and development in visual generation technology.