The paper introduces UniTok, a unified discrete visual tokenizer designed to bridge the representation disparity between visual generation and understanding tasks, thereby facilitating the development of unified Multimodal LLMs (MLLMs). The key idea is to create a single framework capable of encoding fine-grained details for generation and capturing high-level semantics for understanding.
The authors address the challenge of loss conflicts in training such unified tokenizers by identifying limited representational capacity of discrete tokens as the underlying bottleneck. They introduce multi-codebook quantization, which divides vector quantization into several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. This approach raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For example, UniTok achieves a relative Fréchet Inception Distance (rFID) of 0.38 and a zero-shot accuracy of 78.6\% on ImageNet.
Here's a more detailed breakdown:
- Introduction:
- The paper addresses the growing interest in unified MLLMs that integrate visual generation and understanding within a single autoregressive framework.
- A critical challenge lies in the disparity in visual tokenization requirements for understanding (high-level semantics) and generation (fine-grained details).
- The paper argues that the limitations of existing unified tokenizers stem from the limited representational capacity of discrete tokens rather than inherent loss conflicts.
- Related Work:
- Image Tokenization for Generation: Vector-quantized tokenizer is favored for its discrete latent space and compatibility with autoregressive or masked generative models.
- Image Tokenization for Understanding: A common choice of the vision tokenizer is the pretrained CLIP model, which undergoes alignment with language during its pretraining phase.
- Unified Vision-LLMs: Recent advancements have witnessed an increasing focus on unifying visual generation and understanding within one MLLM.
- Method:
The UniTok framework is trained with a combination of VQVAE-based reconstruction loss () and image-text contrastive loss ():
- : The final loss term of UniTok
- : VQVAE-based reconstruction loss
- : image-text contrastive loss
- : weight factor for the corresponding loss term
- The paper identifies token factorization and discretization as key factors that compromise the expressiveness of visual tokens.
- To enhance the latent feature space, multi-codebook quantization (MCQ) is proposed. The latent vector is first evenly split into chunks , where . The subsequent quantization process is denoted as:
- : the discretized latent vector
- : the code index lookup operation
- : -th sub-codebook
- Attention factorization adapts multi-head attention modules for token factorization to preserve rich semantics in original tokens.
- A unified multimodal model is developed with UniTok, leveraging the framework in Liquid. The code embeddings of UniTok are reused by projecting them to the MLLM token space with an MLP projector.
- Experiments:
- Tokenizer Setup: ViTamin-L/16 is used to instantiate UniTok. UniTok is configured with eight sub-codebooks, each containing 4,096 code entries and a latent dimension set to 8-d.
- MLLM Setup: A unified MLLM is instantiated with the Llama-2-7B base model.
- UniTok excels in reconstruction quality compared to both unified and domain-specific tokenizers, recording an rFID of 0.38 on ImageNet.
- The unified MLLM showcases advantages when compared to other unified models that also utilize a discrete visual tokenizer on diverse VQA benchmarks.
- The text-to-image generation performance of the unified MLLM is evaluated on GenAI-Bench.
- The impact of contrastive and reconstruction losses in UniTok training are ablated.
- The number of sub-codebooks is explored to gain deeper insights into multi-codebook quantization.
- The impact of CLIP weight initialization on visual understanding performance is ablated.
- Ablation Studies:
- The paper includes ablation studies to evaluate the impact of different supervision types, the number of sub-codebooks in MCQ, and CLIP weight initialization.
- Results indicate that MCQ generally benefits vector-quantized models, independent of training objectives.
- Downstream VQA performance may not be highly correlated with ImageNet classification accuracy, and CLIP weight initialization may serve as a negative prior for unified tokenizers.
- Conclusion:
- The paper concludes that the challenge in unified tokenizers arises from the limited representational power of discrete tokens.
- Multi-codebook quantization and attention-based factorization can be used to build a unified tokenizer called UniTok.
- Discriminative and generative representation learning do not inherently conflict.
- Extending the training schedule could further benefit the tokenizer, especially in understanding performance.