Show-o2: Improved Native Unified Multimodal Models (2506.15564v2)

Published 18 Jun 2025 in cs.CV

Abstract: This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a LLM, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

PDF Abstract

This paper introduces Show-o2, an improved native unified multimodal model designed to integrate multimodal understanding and generation capabilities across text, images, and videos. The model leverages autoregressive modeling for text processing and flow matching for visual generation, all built upon a 3D causal Variational Autoencoder (VAE) space.

Core Architecture and Methodology:

The Show-o2 framework processes (interleaved) text, image, or video inputs. Text is tokenized and embedded, while visual inputs are encoded into latent representations by a 3D causal VAE. A key innovation is the unified visual representation, constructed via a dual-path mechanism:

Semantic Layers ( $\mathcal{S}(\cdot)$ ): These layers, based on SigLIP vision transformer blocks with a new $2 \times 2$ patch embedding, extract high-level semantic information. They are pre-distilled to mimic SigLIP's feature extraction on both clean and noised visual latents ( $x_t = t \cdot x_1 + (1-t) \cdot x_0$ ).
Projector ( $\mathcal{P}(\cdot)$ ): A 2D patch embedding layer retains low-level visual details.

These high-level and low-level features are then combined through a spatial (-temporal) fusion (STF) mechanism, involving concatenation and MLP layers, to create the unified visual representations ( $u = \text{STF}(\mathcal{S}(x_t), \mathcal{P}(x_t))$ ). A time step embedding $t$ is prepended for generative tasks.

Text embeddings and these unified visual representations are structured into a sequence (e.g., [BOS] {Text} [BOI / BOV] {Image / Video} [EOI / EOV] {Text} ... [EOS]). This sequence is processed by a pre-trained LLM equipped with two heads:

Language Head: Uses autoregressive modeling (causal attention) for text token prediction.
Flow Head: Employs flow matching (full attention within visual representations) to predict velocity ( $v_t = dx_t/dt$ ) for image/video generation. It consists of transformer layers with time step modulation (adaLN-Zero blocks).

The omni-attention mechanism is used to allow causal attention along the sequence while maintaining full attention within the unified visual representations. The final output is decoded by a text de-tokenizer and a 3D causal VAE decoder.

Two-Stage Training Recipe:

To effectively train the model and retain language knowledge without massive text corpora, a two-stage training recipe is proposed:

Stage 1: Pre-training Flow Head & Visual Components:
- Trainable components: Projector, spatial (-temporal) fusion, and flow head. Semantic layers $\mathcal{S}(\cdot)$ are pre-distilled beforehand.
- Data: ~66M image-text pairs, progressively adding interleaved data (OmniCorpus) and video-text pairs (WebVid, Pandas).
- Objective: Autoregressive modeling for text ( $\mathcal{L}_\text{NTP}$ ) and flow matching for visuals ( $\mathcal{L}_\text{FM}$ ), with total loss $\mathcal{L} = \alpha \mathcal{L}_\text{NTP} + \mathcal{L}_\text{FM}$ .
Stage 2: Full Model Fine-tuning:
- Trainable components: Full model (excluding VAE).
- Data: 9M high-quality multimodal understanding instruction data (Densefusion-1M, LLaVA-OneVision) and 16M high-quality visual generation data (filtered from Stage 1 image-text pairs). For video/mixed-modality, high-quality video-text (OpenVid-1M, internal data) and interleaved data (VIST, CoMM) are added.

Scaling Up: For larger models (e.g., 7B LLM parameters), the pre-trained flow head from a smaller model (e.g., 1.5B) is resumed, and a lightweight MLP transformation aligns hidden sizes.

Implementation Details:

Base LLMs: Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct.
VAE: 3D causal VAE from Wan2.1 (8x spatial, 4x temporal compression).
Stage 1 (1.5B model): 150K iterations, LR 0.0001, 66M image-text pairs (432x432 resolution), context length 1024. Total batch sizes: 128 (understanding), 384 (generation). $\alpha=0.2$ . Caption drop probability 0.1 for classifier-free guidance. Then, 40K iterations with 16M HQ generation data.
Stage 2 (1.5B model): ~35K iterations with 9M instruction data and 16M HQ generation data. $\alpha=1.0$ .
Video Data: Randomly sampled 2s 480p or 432x432 clips (17 frames, 3-frame interval). Context length 7006.
Improvements: Further training on higher resolution images (512x512, 1024x1024) and TextAtlas subset for better text rendering.

Experimental Results:

Multimodal Understanding: Show-o2 (1.5B and 7B) achieved state-of-the-art or competitive results on benchmarks like MME, GQA, SEED-Bench, MM-Bench, MMMU, MMStar, and AI2D, outperforming many existing models, including larger ones in some cases.
Image Generation: On GenEval and DPG-Bench, Show-o2 surpassed most compared approaches, achieving strong results with 66M image-text pairs. The 7B model generally outperformed the 1.5B model.
Video Generation: The 2B parameter Show-o2 model (LLM + flow head) outperformed larger models like Show-1, Emu3, and VILA-U on VBench (text-to-video) and showed competitive performance against CogVideoX and Step-Video-T2V. It also performed well on image-to-video tasks.
Mixed-Modality Generation: Demonstrated capability in visual storytelling by interleavingly generating coherent text and images.
Ablation Studies: Confirmed the positive impact of:
- Spatial (-temporal) fusion for both understanding and generation.
- Increasing Classifier-Free Guidance (CFG) scale and inference steps.
- The two-stage training recipe, with Stage 2 significantly improving performance.

Limitations and Broader Impacts:

Limitations: Initial models struggled with text rendering in images and detail in small objects, addressed by further training on higher-resolution and text-rich data.
Broader Impacts: Potential for misuse in creating fake information. Dataset content includes celebrities and copyrighted materials, posing IP infringement risks.

Contributions:

An improved native unified multimodal model (Show-o2) integrating autoregressive modeling and flow matching for text, image, and video understanding and generation.
A unified visual representation based on a 3D causal VAE space using a dual-path spatial (-temporal) fusion, scalable to images and videos.
A two-stage training pipeline for effective learning while retaining language knowledge and enabling scaling.
State-of-the-art performance demonstrated on various multimodal understanding and visual generation benchmarks.