Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation (2506.10395v2)

Published 12 Jun 2025 in cs.CV and cs.AI

Abstract: Recent advances in LLMs have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.

Summary

The paper introduces Pisces, a novel foundation model that decouples image understanding and generation using separate visual encoders.
It employs tailored projection and pooling strategies to optimize each task, resulting in significant benchmark improvements over specialized models.
Joint training of image modalities enhances both tasks, and detailed caption pretraining boosts generation performance.

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

This paper introduces \modelname{}, a unified multimodal foundation model designed for both image understanding and generation (2506.10395). \modelname{} addresses the limitations of existing multimodal models by employing a novel architecture that decouples visual representations for image understanding and generation, allowing each task to leverage distinct image encoders and projection layers. This approach enables the model to achieve state-of-the-art performance on a wide range of benchmarks, surpassing even specialized models in certain tasks.

Model Architecture and Design

Figure 1: The main architecture of \modelname{}, illustrating separate visual encoders and projection layers for image generation (left) and image understanding (right), sharing the same multimodal LLM.

The architecture of \modelname{} is inspired by recent multimodal models and incorporates a pretrained LLM, separate image encoders for image understanding ( $\phi$ ) and image generation ( $\varphi$ ), and a diffusion model. A key innovation is the decoupled visual encoding architecture, which recognizes the intrinsic differences between the visual representations needed for image understanding and generation. Image understanding benefits from detailed, semantically rich information extracted from raw images, necessitating a long sequence of image vectors. In contrast, image generation requires compressing pixel-level information into a compact sequence of vectors optimized for autoregressive generation.

For image understanding, the encoder $\phi$ processes the input image $\mathcal{I}$ into a sequence of continuous image representations $\phi(\mathcal{I})$ , which are then projected into the LLM's latent space:

$\mathcal{V}_n = \text{MLPs}(\phi(\mathcal{I})) \in \mathbb{R}^{n \times d}$

where $n$ is the number of visual tokens, and $d$ is the hidden dimension.

For image generation, the encoder $\varphi$ processes the image $\mathcal{I}$ into a sequence of continuous vectors $\varphi(\mathcal{I})$ . To reduce the computational burden of autoregressively generating long sequences, average pooling is applied to decrease the number of visual tokens:

$\mathcal{V}_m = \text{MLP}(\text{Pool}_{a \times a}(\varphi(\mathcal{I}))) \in \mathbb{R}^{m \times d}$

where $\text{Pool}_{a \times a}$ denotes 2D pooling with a stride of $a$ , resulting in $m$ tokens.

The LLM manages both tasks through a carefully designed training objective. For image understanding, image vectors $\mathcal{V}_n$ are prepended to text embeddings $\mathcal{T}$ to create $\mathcal{X} = [\mathcal{V}_n ; \mathcal{T}]$ , while for image generation, image vectors $\mathcal{V}_m$ are appended to text embeddings $\mathcal{T}$ , forming $\mathcal{X} = [\mathcal{T} ; \mathcal{V}_m]$ . The unified training objective is defined as:

$\mathcal{L} = - \sum^{\mathcal{D} \sum_{i=1}^{N} P_{\theta}(x_i | x_1,x_2,...,x_{n-1})$

where $x_i$ represents either a discrete text token or a continuous image vector, and $\theta$ represents the parameters of the multimodal LLM. A conditional diffusion model acts as a decoder, reconstructing the image from the predicted visual vectors $\mathcal{V}_m$ .

Training Methodology

The training process for \modelname{} is divided into three stages: multimodal pretraining, fine-grained multimodal pretraining, and instruction tuning for image understanding and generation.

Multimodal Pretraining: The model is pretrained on 150 million high-quality image-caption pairs from Shutterstock for both image captioning and generation. Detailed captions generated by the Llama 3.2 model are used for image understanding.
Fine-Grained Multimodal Pretraining: The model continues pretraining on 70 million image and detailed caption pairs. This stage enhances alignment between textual tokens and visual features.
Instruction Tuning: The model is further refined on a curated instruction-tuning dataset comprising 8 million high-quality image-text pairs for image understanding and 4 million image-caption pairs for image generation.

The multimodal LLM is initialized with LLaMA-3.1-Instruct 8B, using siglip-so400m-patch14-384 as the vision encoder for image understanding and a CLIP model trained with MAE reconstruction loss and contrastive loss as the vision encoder for image generation. The SDXL model is used as an image decoder, reconstructing images from gen-CLIP image embeddings.

Experimental Results and Analysis

\modelname{}'s performance was evaluated on a comprehensive set of benchmarks for both image understanding and generation.

Image Understanding

On general multimodal benchmarks, \modelname{} achieves state-of-the-art performance compared to open-source unified models, even outperforming models two to four times larger. For instance, it achieves 26.3\% higher scores on MMBench compared to EMU3 and 8.6\% higher scores on MME-P compared to Seed-X. \modelname{} also excels on domain-specific benchmarks, including vision-centric, knowledge-based, and OCR benchmarks, achieving comparable or superior performance to specialized models.

Image Generation

Figure 2: Qualitative comparison of image generation results, demonstrating \modelname{}'s ability to follow complex user prompts and generate high-quality images.

\modelname{} achieves competitive performance on the GenEval benchmark among unified understanding and generation models, demonstrating its strong instruction-following capabilities in image generation. Qualitative results further illustrate the model's ability to generate high-quality images from complex user prompts.

Synergistic Relationship

Ablation studies reveal a synergistic relationship between image understanding and generation. Training these tasks together within a unified multimodal framework shows that image understanding significantly enhances image generation performance, and conversely, image generation benefits image understanding performance. This is supported by experiments where models trained without either image understanding or image generation tasks show degraded performance.

Benefits of Decoupled Vision Encoders

The effectiveness of using decoupled visual encoders is verified through ablation studies. Using a single encoder (gen-CLIP) for both image understanding and generation results in on-par performance for image generation but inferior image understanding performance. Training SDXL to decode SigLIP features did not achieve the same level of reconstruction performance as Gen-CLIP+SDXL, highlighting the benefit of decoupling image encoders.

Impact of the Number of Visual Tokens

Experiments varying the number of visual tokens for image generation demonstrate that longer sequence lengths pose a significant challenge for LLMs. Reducing the number of visual tokens lowers the training loss, with an intermediate approach of pooling with a stride of 3 providing the best balance between low loss and preserving crucial visual information.

Effect of Detailed Captions on Image Generation

The significance of the second-stage pretraining on detailed captions is illustrated by comparing models trained with short captions versus long captions. Incorporating more short captions does not yield further improvement in the model's FID, whereas using long captions consistently enhances image generation performance.

Conclusion

\modelname{} effectively bridges the gap between image understanding and generation by introducing an asymmetrical visual encoding architecture and task-specific training techniques. The model achieves strong performance across diverse modalities, demonstrating the potential of unified multimodal foundation models. The insights gained from \modelname{} can inspire future research in developing more robust and versatile approaches to multimodal modeling.

PDF Markdown

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation (2506.10395v2)

Summary

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Model Architecture and Design

Training Methodology

Experimental Results and Analysis

Image Understanding

Image Generation

Synergistic Relationship

Benefits of Decoupled Vision Encoders

Impact of the Number of Visual Tokens

Effect of Detailed Captions on Image Generation

Conclusion

Follow-up Questions

Authors (13)

Don't miss out on important new AI/ML research

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation (2506.10395v2)

Summary

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Model Architecture and Design

Training Methodology

Experimental Results and Analysis

Image Understanding

Image Generation

Synergistic Relationship

Benefits of Decoupled Vision Encoders

Impact of the Number of Visual Tokens

Effect of Detailed Captions on Image Generation

Conclusion

Follow-up Questions

Related Papers

Authors (13)

Don't miss out on important new AI/ML research