UniTok: A Unified Tokenizer for Visual Generation and Understanding (2502.20321v2)

Published 27 Feb 2025 in cs.CV and cs.AI

Abstract: Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256$\times$256 benchmark. GitHub: https://github.com/FoundationVision/UniTok.

PDF Abstract

The paper introduces UniTok, a unified discrete visual tokenizer designed to bridge the representation disparity between visual generation and understanding tasks, thereby facilitating the development of unified Multimodal LLMs (MLLMs). The key idea is to create a single framework capable of encoding fine-grained details for generation and capturing high-level semantics for understanding.

The authors address the challenge of loss conflicts in training such unified tokenizers by identifying limited representational capacity of discrete tokens as the underlying bottleneck. They introduce multi-codebook quantization, which divides vector quantization into several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. This approach raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For example, UniTok achieves a relative FrÃ©chet Inception Distance (rFID) of 0.38 and a zero-shot accuracy of 78.6\% on ImageNet.

Here's a more detailed breakdown:

Introduction:
- The paper addresses the growing interest in unified MLLMs that integrate visual generation and understanding within a single autoregressive framework.
- A critical challenge lies in the disparity in visual tokenization requirements for understanding (high-level semantics) and generation (fine-grained details).
- The paper argues that the limitations of existing unified tokenizers stem from the limited representational capacity of discrete tokens rather than inherent loss conflicts.
Related Work:
- Image Tokenization for Generation: Vector-quantized tokenizer is favored for its discrete latent space and compatibility with autoregressive or masked generative models.
- Image Tokenization for Understanding: A common choice of the vision tokenizer is the pretrained CLIP model, which undergoes alignment with language during its pretraining phase.
- Unified Vision-LLMs: Recent advancements have witnessed an increasing focus on unifying visual generation and understanding within one MLLM.
Method:
- The UniTok framework is trained with a combination of VQVAE-based reconstruction loss ( $\mathcal{L}_\text{recon}$ ) and image-text contrastive loss ( $\mathcal{L}_\text{contra}$ ):
  
  $\mathcal{L} =\mathcal{L}_\text{recon} + \lambda_\text{contra}\mathcal{L}_\text{contra}$
  - $\mathcal{L}$ : The final loss term of UniTok
  - $\mathcal{L}_\text{recon}$ : VQVAE-based reconstruction loss
  - $\mathcal{L}_\text{contra}$ : image-text contrastive loss
  - $\lambda_\text{contra}$ : weight factor for the corresponding loss term
- The paper identifies token factorization and discretization as key factors that compromise the expressiveness of visual tokens.
- To enhance the latent feature space, multi-codebook quantization (MCQ) is proposed. The latent vector $f \in \mathbb{R}^{d}$ $f \in R^{d}$ is first evenly split into $n$ $n$ chunks $\left\{ f_{1}, f_{2}, ..., f_{n} \right\}$ ${f_{1}, f_{2}, ..., f_{n}}$ , where $f_{i} \in \mathbb{R}^{\frac{d}{n}}$ $f_{i} \in R^{\frac{d}{n}}$ . The subsequent quantization process is denoted as:
  
  $\hat{f} = \text{Concat}\left(\mathcal{Q}\left( Z_{1}, f_{1} \right), \mathcal{Q}\left( Z_{2}, f_{2} \right), ..., \mathcal{Q}\left( Z_{n}, f_{n} \right) \right)$
  - $\hat{f}$ : the discretized latent vector
  - $\mathcal{Q}$ : the code index lookup operation
  - $Z_{i}$ : $i$ -th sub-codebook
- Attention factorization adapts multi-head attention modules for token factorization to preserve rich semantics in original tokens.
- A unified multimodal model is developed with UniTok, leveraging the framework in Liquid. The code embeddings of UniTok are reused by projecting them to the MLLM token space with an MLP projector.
Experiments:
- Tokenizer Setup: ViTamin-L/16 is used to instantiate UniTok. UniTok is configured with eight sub-codebooks, each containing 4,096 code entries and a latent dimension set to 8-d.
- MLLM Setup: A unified MLLM is instantiated with the Llama-2-7B base model.
- UniTok excels in reconstruction quality compared to both unified and domain-specific tokenizers, recording an rFID of 0.38 on ImageNet.
- The unified MLLM showcases advantages when compared to other unified models that also utilize a discrete visual tokenizer on diverse VQA benchmarks.
- The text-to-image generation performance of the unified MLLM is evaluated on GenAI-Bench.
- The impact of contrastive and reconstruction losses in UniTok training are ablated.
- The number of sub-codebooks is explored to gain deeper insights into multi-codebook quantization.
- The impact of CLIP weight initialization on visual understanding performance is ablated.
Ablation Studies:
- The paper includes ablation studies to evaluate the impact of different supervision types, the number of sub-codebooks in MCQ, and CLIP weight initialization.
- Results indicate that MCQ generally benefits vector-quantized models, independent of training objectives.
- Downstream VQA performance may not be highly correlated with ImageNet classification accuracy, and CLIP weight initialization may serve as a negative prior for unified tokenizers.
Conclusion:
- The paper concludes that the challenge in unified tokenizers arises from the limited representational power of discrete tokens.
- Multi-codebook quantization and attention-based factorization can be used to build a unified tokenizer called UniTok.
- Discriminative and generative representation learning do not inherently conflict.
- Extending the training schedule could further benefit the tokenizer, especially in understanding performance.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Chuofan Ma (8 papers)
Yi Jiang (171 papers)
Junfeng Wu (71 papers)
Jihan Yang (19 papers)
Xin Yu (192 papers)
Zehuan Yuan (65 papers)
Bingyue Peng (11 papers)
Xiaojuan Qi (133 papers)

Related Papers

Find Related Papers

GitHub

GitHub - FoundationVision/UniTok: A Unified Tokenizer for Visual Generation and Understanding (31 stars)

Tweets

https://twitter.com/gm8xx8/status/1895423713979580479

https://twitter.com/Gradio/status/1895477538262061170

https://twitter.com/aigclink/status/1896487169512067308

https://twitter.com/arxivsanitybot/status/1896027019856445653

https://twitter.com/semisance/status/1895421834474242558