Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 398 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation (2409.04410v3)

Published 6 Sep 2024 in cs.CV and cs.AI

Abstract: The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2{18}$ codes), and achieves the state-of-the-art reconstruction performance on ImageNet and UCF benchmarks. We also provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks (1.93 vs. 0.78 rFID on ImageNet original resolution). Furthermore, we explore its application in plain auto-regressive models to validate scalability properties, producing a family of auto-regressive image generation models ranging from 300M to 1.5B. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce ``next sub-token prediction'' to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.

Citations (19)

Summary

  • The paper replicates and open-sources a Lookup-Free Quantizer that nearly matches MAGVIT2’s performance on ImageNet benchmarks.
  • The paper introduces asymmetric token factorization and next sub-token prediction to enhance large-vocabulary handling in auto-regressive models.
  • The paper demonstrates state-of-the-art visual reconstruction and improved image generation metrics compared to previous methods.

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Introduction

The paper presents Open-MAGVIT2, a collection of open-source, auto-regressive image generation models with parameter sizes ranging from 300 million to 1.5 billion. The project aims to replicate and open-source Google's MAGVIT-v2 tokenizer, which features an extensive codebook of 2182^{18} codes, delivering state-of-the-art performance on ImageNet 256×256256 \times 256. Additionally, the authors address the scalability issues in plain auto-regressive models by introducing asymmetric token factorization and "next sub-token prediction" to enhance token interaction and generation quality.

Background

Auto-regressive Visual Generation Models

Auto-regressive models have demonstrated dominance in natural language generation, and recent research has extended their applicability to visual generation. These approaches typically involve vector quantization for image tokenization and de-tokenization, followed by the application of auto-regressive transformers for discrete image token sequence modeling. Despite their promise, visual generation quality in auto-regressive models often lags behind diffusion-based methods, largely due to limited tokenizer performance.

MAGVIT-v2 Tokenizer

MAGVIT-v2 introduces a highly efficient visual tokenizer that achieves superior generation quality via a Lookup-Free Quantizer with a remarkably large codebook. However, its closed-source nature limits broader academic and practical advancements.

Contributions

Replication of Visual Tokenizer

The primary contribution of this paper is the replication of the advanced Lookup-Free Quantizer proposed by MAGVIT-v2. The replicated tokenizer achieves a reconstruction performance very close to that of MAGVIT-v2 (1.18 vs. 1.15 rFID on ImageNet 128 × 128) and outperforms previous methods on the ImageNet benchmark. This effort democratizes access to a state-of-the-art visual tokenizer, facilitating further research and innovation in the field.

Integrating a Super-Large Codebook

The authors also explore integrating a super-large codebook within plain auto-regressive visual generation models. Instead of adhering to MAGVIT-v2's vision-oriented design (e.g., mask generative methods), the paper leverages a large codebook in vanilla auto-regressive generation. This approach includes factorizing the vocabulary into two sub-vocabularies of different sizes and introducing "next sub-token prediction" to improve sub-token interaction and generation quality.

Methodology

Visual Tokenizer

The visual tokenizer follows the architecture proposed in MAGVIT-v2 and comprises:

  1. A CNN-based encoder projecting the input image into a feature map.
  2. A Lookup-Free Quantizer that maps each feature vector to the closest entry in a super-large codebook, represented as a Cartesian product of single-dimensional variables.
  3. A decoder that reconstructs the image from quantized features, incorporating Adaptive GroupNorm to integrate quantized vectors with residual block outputs for enhanced reconstruction quality.

Auto-Regressive Transformer

The auto-regressive transformer models are designed to handle the large vocabulary efficiently:

  1. Asymmetric Token Factorization: This assists models in predicting with a super-large vocabulary by factorizing it into multiple sub-vocabularies, each embedded individually, and summing them as transformer inputs.
  2. Next Sub-Token Prediction: To model both intra- and inter-token dependencies, the auto-regressive transformer sequentially predicts each sub-token's conditional probability using context-enriched vectors.

Results

Visual Reconstruction

Open-MAGVIT2 achieves state-of-the-art performance in visual reconstruction on ImageNet 256×256256 \times 256, outperforming other tokenizers such as VQGAN and LlamaGen, particularly in detailed image regeneration and precise facial and text reconstruction.

Visual Generation

In the auto-regressive visual generation tasks, Open-MAGVIT2 demonstrates superiority and scalability, yielding better FID and IS metrics than competing models. This performance underscores the potential of leveraging a super-large codebook in auto-regressive models.

Implications and Future Work

The replication and open-sourcing of the Lookup-Free Quantizer present significant practical and theoretical implications as it democratizes access to high-performance visual tokenization technology, potentially accelerating advancements in scalable and high-quality auto-regressive visual generation models. Future research may focus on extending this framework to more extensive training datasets and larger models to fully exploit the enhanced representational capacity of a super-large codebook, thereby impacting broader applications such as text-conditional image and video generation.

Acknowledgments

The authors acknowledge contributions and discussions from their colleagues, and they emphasize their open-source effort as a means to foster more innovative and creative works in the field of auto-regressive visual generation. Their work highlights the importance of accessible high-performance models to advancing research and development in visual generation technology.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 6 tweets and received 275 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com