UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding (2504.04423v1)

Published 6 Apr 2025 in cs.CV and cs.AI

Abstract: We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at https://github.com/SxJyJay/UniToken.

Summary

An Overview of UniToken: Unified Visual Encoding for Multimodal Understanding and Generation

The paper presents "UniToken," a novel approach in the domain of multimodal modeling, designed to harmonize visual understanding and image generation tasks through a unified visual encoding framework. The innovation lies in its ability to integrate discrete and continuous representations of visual inputs, thereby breaking away from conventional dual-path frameworks typically employed in the field. This allows for a model architecture that can simultaneously excel in tasks requiring both high-level semantic comprehension and the generation of detailed, high-fidelity images.

The paper's premise builds on the transformative capabilities of LLMs applied to multimodal contexts, following the success seen in text-based domains. The researchers introduce UniToken as an autoregressive model leveraging dual visual encoders: a VQ-GAN based encoder for discrete tokenization, and a SigLIP ViT for continuous representation. The union of these complementary encoding strategies within a shared framework optimizes information retention across tasks without the commonly encountered task interference from prior decoupled methodologies.

Key experiments demonstrate UniToken's superior performance across various benchmarks, both in visual understanding and image generation, positioning it above existing state-of-the-art approaches. Its architecture not only balances task-specific learnings but excels in task-specific scenarios. Particularly noteworthy are the empirical insights into tackling task interference—a traditional challenge in multimodal models—through combined encoding strategies that mitigate issues associated with the prior separate models. The paper also uncovers optimal data distribution practices critical to preserving task-specific strengths without compromising on performance.

UniToken's training involves a nuanced three-stage process aimed at incremental capability building. Stage I aligns discrete and continuous encodings, Stage II broadens this foundation over an expansive multimodal dataset, and Stage III specializes focus toward high-quality multimodal task execution. This methodology ensures that UniToken scales effectively, offering robust capabilities without necessitating mode switching burden.

Future studies could explore even deeper integration with diffusion models or further calibration of data proportions as model capabilities and data volumes expand. The release of UniToken's code and model weights invites further empirical validation and adaptation, facilitating subsequent innovations in creating more comprehensive, generalizable models in artificial intelligence.

Overall, the UniToken framework exemplifies a balanced blend of innovative dual-task handling and practical insights into model training, setting a precedent in unified multimodal research. This paper undeniably contributes significant knowledge toward advancing the field of AI with its focus on streamlined and efficient model architectures for complex multimodal tasks.

Related Papers

GitHub

GitHub - SxJyJay/UniToken: [CVPRW 2025] UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inputs, making it easy to integrate both visual understanding and image generation tasks seamlessly. (38 stars)