Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation (2409.04410v3)

Published 6 Sep 2024 in cs.CV and cs.AI

Abstract: The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance on ImageNet and UCF benchmarks. We also provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks (1.93 vs. 0.78 rFID on ImageNet original resolution). Furthermore, we explore its application in plain auto-regressive models to validate scalability properties, producing a family of auto-regressive image generation models ranging from 300M to 1.5B. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce ``next sub-token prediction'' to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.

Citations (19)

View on Semantic Scholar

Summary

The paper replicates and open-sources a Lookup-Free Quantizer that nearly matches MAGVIT2’s performance on ImageNet benchmarks.
The paper introduces asymmetric token factorization and next sub-token prediction to enhance large-vocabulary handling in auto-regressive models.
The paper demonstrates state-of-the-art visual reconstruction and improved image generation metrics compared to previous methods.

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Introduction

The paper presents Open-MAGVIT2, a collection of open-source, auto-regressive image generation models with parameter sizes ranging from 300 million to 1.5 billion. The project aims to replicate and open-source Google's MAGVIT-v2 tokenizer, which features an extensive codebook of $2^{18}$ codes, delivering state-of-the-art performance on ImageNet $256 \times 256$ . Additionally, the authors address the scalability issues in plain auto-regressive models by introducing asymmetric token factorization and "next sub-token prediction" to enhance token interaction and generation quality.

Background

Auto-regressive Visual Generation Models

Auto-regressive models have demonstrated dominance in natural language generation, and recent research has extended their applicability to visual generation. These approaches typically involve vector quantization for image tokenization and de-tokenization, followed by the application of auto-regressive transformers for discrete image token sequence modeling. Despite their promise, visual generation quality in auto-regressive models often lags behind diffusion-based methods, largely due to limited tokenizer performance.

MAGVIT-v2 Tokenizer

MAGVIT-v2 introduces a highly efficient visual tokenizer that achieves superior generation quality via a Lookup-Free Quantizer with a remarkably large codebook. However, its closed-source nature limits broader academic and practical advancements.

Contributions

Replication of Visual Tokenizer

The primary contribution of this paper is the replication of the advanced Lookup-Free Quantizer proposed by MAGVIT-v2. The replicated tokenizer achieves a reconstruction performance very close to that of MAGVIT-v2 (1.18 vs. 1.15 rFID on ImageNet 128 × 128) and outperforms previous methods on the ImageNet benchmark. This effort democratizes access to a state-of-the-art visual tokenizer, facilitating further research and innovation in the field.

Integrating a Super-Large Codebook

The authors also explore integrating a super-large codebook within plain auto-regressive visual generation models. Instead of adhering to MAGVIT-v2's vision-oriented design (e.g., mask generative methods), the paper leverages a large codebook in vanilla auto-regressive generation. This approach includes factorizing the vocabulary into two sub-vocabularies of different sizes and introducing "next sub-token prediction" to improve sub-token interaction and generation quality.

Methodology

Visual Tokenizer

The visual tokenizer follows the architecture proposed in MAGVIT-v2 and comprises:

A CNN-based encoder projecting the input image into a feature map.
A Lookup-Free Quantizer that maps each feature vector to the closest entry in a super-large codebook, represented as a Cartesian product of single-dimensional variables.
A decoder that reconstructs the image from quantized features, incorporating Adaptive GroupNorm to integrate quantized vectors with residual block outputs for enhanced reconstruction quality.

Auto-Regressive Transformer

The auto-regressive transformer models are designed to handle the large vocabulary efficiently:

Asymmetric Token Factorization: This assists models in predicting with a super-large vocabulary by factorizing it into multiple sub-vocabularies, each embedded individually, and summing them as transformer inputs.
Next Sub-Token Prediction: To model both intra- and inter-token dependencies, the auto-regressive transformer sequentially predicts each sub-token's conditional probability using context-enriched vectors.

Results

Visual Reconstruction

Open-MAGVIT2 achieves state-of-the-art performance in visual reconstruction on ImageNet $256 \times 256$ , outperforming other tokenizers such as VQGAN and LlamaGen, particularly in detailed image regeneration and precise facial and text reconstruction.

Visual Generation

In the auto-regressive visual generation tasks, Open-MAGVIT2 demonstrates superiority and scalability, yielding better FID and IS metrics than competing models. This performance underscores the potential of leveraging a super-large codebook in auto-regressive models.

Implications and Future Work

The replication and open-sourcing of the Lookup-Free Quantizer present significant practical and theoretical implications as it democratizes access to high-performance visual tokenization technology, potentially accelerating advancements in scalable and high-quality auto-regressive visual generation models. Future research may focus on extending this framework to more extensive training datasets and larger models to fully exploit the enhanced representational capacity of a super-large codebook, thereby impacting broader applications such as text-conditional image and video generation.

Acknowledgments

The authors acknowledge contributions and discussions from their colleagues, and they emphasize their open-source effort as a means to foster more innovative and creative works in the field of auto-regressive visual generation. Their work highlights the importance of accessible high-performance models to advancing research and development in visual generation technology.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1832986403195802039

https://twitter.com/PourjafarNima/status/1832978241721655725

https://twitter.com/gm8xx8/status/1832968681455792541

https://twitter.com/Kokingkoal/status/1833090987947892947

https://twitter.com/javaeeeee1/status/1833261084280774815

YouTube

Show All Videos