Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 147 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 188 tok/s Pro

GPT OSS 120B 398 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation (2409.04410v3)

Published 6 Sep 2024 in cs.CV and cs.AI

Abstract: The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance on ImageNet and UCF benchmarks. We also provide a tokenizer pre-trained on large-scale data, significantly outperforming Cosmos on zero-shot benchmarks (1.93 vs. 0.78 rFID on ImageNet original resolution). Furthermore, we explore its application in plain auto-regressive models to validate scalability properties, producing a family of auto-regressive image generation models ranging from 300M to 1.5B. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce ``next sub-token prediction'' to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.

Citations (19)

View on Semantic Scholar

Summary

The paper replicates and open-sources a Lookup-Free Quantizer that nearly matches MAGVIT2’s performance on ImageNet benchmarks.
The paper introduces asymmetric token factorization and next sub-token prediction to enhance large-vocabulary handling in auto-regressive models.
The paper demonstrates state-of-the-art visual reconstruction and improved image generation metrics compared to previous methods.

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Introduction

The paper presents Open-MAGVIT2, a collection of open-source, auto-regressive image generation models with parameter sizes ranging from 300 million to 1.5 billion. The project aims to replicate and open-source Google's MAGVIT-v2 tokenizer, which features an extensive codebook of $2^{18}$ codes, delivering state-of-the-art performance on ImageNet $256 \times 256$ . Additionally, the authors address the scalability issues in plain auto-regressive models by introducing asymmetric token factorization and "next sub-token prediction" to enhance token interaction and generation quality.

Background

Auto-regressive Visual Generation Models

Auto-regressive models have demonstrated dominance in natural language generation, and recent research has extended their applicability to visual generation. These approaches typically involve vector quantization for image tokenization and de-tokenization, followed by the application of auto-regressive transformers for discrete image token sequence modeling. Despite their promise, visual generation quality in auto-regressive models often lags behind diffusion-based methods, largely due to limited tokenizer performance.

MAGVIT-v2 Tokenizer

MAGVIT-v2 introduces a highly efficient visual tokenizer that achieves superior generation quality via a Lookup-Free Quantizer with a remarkably large codebook. However, its closed-source nature limits broader academic and practical advancements.

Contributions

Replication of Visual Tokenizer

The primary contribution of this paper is the replication of the advanced Lookup-Free Quantizer proposed by MAGVIT-v2. The replicated tokenizer achieves a reconstruction performance very close to that of MAGVIT-v2 (1.18 vs. 1.15 rFID on ImageNet 128 × 128) and outperforms previous methods on the ImageNet benchmark. This effort democratizes access to a state-of-the-art visual tokenizer, facilitating further research and innovation in the field.

Integrating a Super-Large Codebook

The authors also explore integrating a super-large codebook within plain auto-regressive visual generation models. Instead of adhering to MAGVIT-v2's vision-oriented design (e.g., mask generative methods), the paper leverages a large codebook in vanilla auto-regressive generation. This approach includes factorizing the vocabulary into two sub-vocabularies of different sizes and introducing "next sub-token prediction" to improve sub-token interaction and generation quality.

Methodology

Visual Tokenizer

The visual tokenizer follows the architecture proposed in MAGVIT-v2 and comprises:

A CNN-based encoder projecting the input image into a feature map.
A Lookup-Free Quantizer that maps each feature vector to the closest entry in a super-large codebook, represented as a Cartesian product of single-dimensional variables.
A decoder that reconstructs the image from quantized features, incorporating Adaptive GroupNorm to integrate quantized vectors with residual block outputs for enhanced reconstruction quality.

Auto-Regressive Transformer

The auto-regressive transformer models are designed to handle the large vocabulary efficiently:

Asymmetric Token Factorization: This assists models in predicting with a super-large vocabulary by factorizing it into multiple sub-vocabularies, each embedded individually, and summing them as transformer inputs.
Next Sub-Token Prediction: To model both intra- and inter-token dependencies, the auto-regressive transformer sequentially predicts each sub-token's conditional probability using context-enriched vectors.

Results

Visual Reconstruction

Open-MAGVIT2 achieves state-of-the-art performance in visual reconstruction on ImageNet $256 \times 256$ , outperforming other tokenizers such as VQGAN and LlamaGen, particularly in detailed image regeneration and precise facial and text reconstruction.

Visual Generation

In the auto-regressive visual generation tasks, Open-MAGVIT2 demonstrates superiority and scalability, yielding better FID and IS metrics than competing models. This performance underscores the potential of leveraging a super-large codebook in auto-regressive models.

Implications and Future Work

The replication and open-sourcing of the Lookup-Free Quantizer present significant practical and theoretical implications as it democratizes access to high-performance visual tokenization technology, potentially accelerating advancements in scalable and high-quality auto-regressive visual generation models. Future research may focus on extending this framework to more extensive training datasets and larger models to fully exploit the enhanced representational capacity of a super-large codebook, thereby impacting broader applications such as text-conditional image and video generation.

Acknowledgments

The authors acknowledge contributions and discussions from their colleagues, and they emphasize their open-source effort as a means to foster more innovative and creative works in the field of auto-regressive visual generation. Their work highlights the importance of accessible high-performance models to advancing research and development in visual generation technology.