Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Published 23 Jun 2025 in cs.CV, cs.AI, cs.CL, and cs.MM | (2506.18898v1)

Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a LLM's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces Tar, a unified multimodal framework that uses a novel Text-Aligned Tokenizer to align image tokens with LLM embeddings.
It demonstrates strong performance in both visual understanding and high-fidelity image generation, outperforming models at the 1.5B/7B scale.
Ablation studies confirm that initializing the visual codebook with LLM embeddings significantly boosts multimodal learning efficiency.

Tar: Unifying Vision and Language through Text-Aligned Representations

This paper introduces Tar, a multimodal LLM framework designed to unify visual understanding and generation through a shared discrete semantic representation. The core innovation is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a LLM's vocabulary. This approach aims to provide a unified interface for cross-modal input and output, eliminating the need for modality-specific designs and facilitating advanced multimodal reasoning.

Text-Aligned Tokenizer (TA-Tok)

TA-Tok is central to aligning visual representations with the LLM's latent space. The process involves encoding an input image $I$ into a continuous vector $\mathbf{z}_I = \mathcal{E}(I)$ using a visual encoder $\mathcal{E}$ . This vector is then mapped to the closest vector in a codebook $\mathcal{C}$ through a quantization process:

$\mathbf{z}_q = \mathrm{argmin}_{\mathbf{c}\in\mathcal{C}}\|\mathbf{z}_I-\mathbf{c}\|^2$

Figure 1: An input image is first encoded into continuous tokens using a SigLIP2 vision model.

Traditional VQ codebooks are randomly initialized; however, TA-Tok initializes the VQ codebook using token embeddings from a pre-trained LLM, denoted as $\mathbf{E} \in \mathbb{R}^{K \times D}$ , and a projection matrix $\mathbf{W} \in \mathbb{R}^{D \times D}$ . The codebook $\mathbf{C} \in \mathbb{R}^{K \times D}$ is defined as:

$\mathbf{C} = \mathbf{E}\mathbf{W} = \{\mathbf{E}\mathbf{w}_1, \mathbf{E}\mathbf{w}_2, \dots, \mathbf{E}\mathbf{w}_D\}$

This ensures that each codebook entry is a projected version of a corresponding LLM token embedding, thereby grounding the visual codebook in the LLM's latent space. The paper also introduces Scale-Adaptive Pooling (SAP) and Decoding (SAD) to accommodate varying needs for visual details across different tasks. TA-Tok is trained with a combination of feature reconstruction loss $\mathcal{L}_{rec}$ and codebook losses $\mathbf{z}_I = \mathcal{E}(I)$ 0. The reconstruction loss encourages semantic alignment, defined as:

$\mathbf{z}_I = \mathcal{E}(I)$ 1

where $\mathbf{z}_I = \mathcal{E}(I)$ 2 and $\mathbf{z}_I = \mathcal{E}(I)$ 3 are features from SAD and a SigLIP2 teacher, respectively.

Generative De-Tokenizer

To generate high-quality images from the quantized outputs of TA-Tok, the paper introduces a Generative De-Tokenizer with two variants: an autoregressive de-tokenizer (AR-DTok) and a diffusion de-tokenizer (Dif-DTok). AR-DTok formulates image decoding as an autoregressive generation task. Given image tokens $\mathbf{z}_I = \mathcal{E}(I)$ 4 from a VQVAE encoder and the parameters $\mathbf{z}_I = \mathcal{E}(I)$ 5 of AR-DTok, the loss function is:

$\mathbf{z}_I = \mathcal{E}(I)$ 6

Dif-DTok uses quantized tokens $\mathbf{z}_I = \mathcal{E}(I)$ 7 as conditioning input to a diffusion model $\mathbf{z}_I = \mathcal{E}(I)$ 8 parameterized by $\mathbf{z}_I = \mathcal{E}(I)$ 9. The training objective is:

$\mathcal{E}$ 0

Figure 2: Qualitative comparison of different representations (left) and de-tokenizers (right).

Unified Multimodal Modeling

Built on TA-Tok and Generative De-Tokenizers, Tar unifies visual understanding and generation using a simple autoregressive objective. The LLM's text embedding matrix $\mathcal{E}$ 1 is expanded with a visual token set $\mathcal{E}$ 2, using the Text-Aligned Codebook $\mathcal{E}$ 3 as visual embeddings. Tar is trained with the standard Cross-Entropy loss over a mixed sequence of text and visual tokens. Given a target sequence $\mathcal{E}$ 4 and model parameter $\mathcal{E}$ 5, the loss is:

$\mathcal{E}$ 6

Experimental Results

The Tar model demonstrates strong visual understanding performance across a range of benchmarks. Notably, the 1.5B model surpasses most understanding-only models and unified models at the 1.5B/7B scale. The 7B model matches the performance of Janus-Pro-7B, a state-of-the-art model with continuous visual tokens. In visual generation, Tar achieves strong performance on both GenEval [Ghosh et al., 2023] and DPG Bench.

Figure 3: Text-to-Image Generation Results, using Tar-7B and a 1024 pixel de-tokenizer.

Ablation Studies

Ablation studies validate the effectiveness of TA-Tok. Initializing LLM embeddings with the TA Codebook leads to better performance compared to random initialization. The generative de-tokenizer design also proves effective, with AR-DTok performing well across resolutions and scales, and Dif-DTok quickly adapting to generate high-fidelity images due to its pre-trained diffusion backbone. The design of SAP allows the model to produce multi-granularity visual tokens where visual understanding tasks benefit from more tokens to capture image details, but longer visual sequences do not significantly improve image generation performance. The unified pretraining tasks, including image-to-image (I $\mathcal{E}$ 7I) and text-image-to-image (TI $\mathcal{E}$ 8I), further improve generation performance, narrowing the gap between understanding and generation.

Conclusion

This paper presents Tar, a unified model that uses a shared, discrete, text-aligned representation to bridge visual understanding and generation. The alignment of image tokens with LLM embeddings via TA-Tok, along with scale-adaptive tokenization and generative de-tokenizers, enables strong performance across both tasks without modality-specific designs. This approach demonstrates that a fully discrete, semantic representation can facilitate efficient and effective multimodal learning.

Markdown Report Issue