Show-o2 Model: A Unified Multimodal Transformer

Updated 22 June 2025

The Show-o2 Model is an advanced unified multimodal transformer architecture designed for effective understanding and generation across text, image, and video modalities. It achieves native multimodal handling by integrating autoregressive modeling for language with flow matching for visual generative processes. Central to Show-o2 is a scalable dual-path fusion mechanism within a 3D causal variational autoencoder (VAE) space, supporting both spatial and temporal data, and enabling seamless extension from images to videos. This design facilitates high adaptability for a broad spectrum of multimodal tasks, from visual question answering to coherent visual storytelling. The complete implementation and model weights are openly available at https://github.com/showlab/Show-o.

1. Model Architecture

Show-o2 introduces a 3D causal VAE as its foundational latent space, unifying the representation of both static images and dynamic video sequences. The encoder processes raw images or video frames, outputting latent tensors $x_t$ that reflect both spatial and temporal structure. For generative purposes, noise scheduling interpolates between sampled noise and latent codes: $x_t = t \cdot x_1 + (1 - t) \cdot x_0$ where $x_0 \sim \mathcal{N}(0,1)$ and $t$ ranges from 0 to 1. Decoding leverages a learned 3D causal VAE decoder, reconstructing visual data while preserving consistency with the input's modality and structure.

The architecture supports interleaved sequences of text and unified visual representations. Text tokens are embedded via a dedicated language branch; visual tokens—comprising both semantic and low-level features—are fused and interleaved with textual data, enabling causal multimodal attention through the sequence.

2. Dual-Path Spatial-Temporal Fusion Mechanism

To ensure rich multimodal representations, Show-o2 employs a dual-path fusion strategy:

Semantic Path ( $\mathcal{S}(\cdot)$ ): Utilizes Vision Transformer layers, adapted from SigLIP, applied to 3D VAE latents for extracting high-level semantics. Distillation loss aligns these features with pre-trained SigLIP outputs for the same image:

$\mathcal{L}_{\text{distill}} = -\sum \log \text{sim}(\mathcal{S}(x_t), \text{SigLIP}(X))$

where $\text{sim}$ denotes cosine similarity and $X$ is the raw input image.

Projector Path ( $\mathcal{P}(\cdot)$ ): Embeds low-level spatial (and, for video, temporal) features from the VAE tensor.

Fusion proceeds by concatenating outputs along the feature dimension and processing them through normalization (RMSNorm) and multilayer perceptrons, resulting in unified visual representations $u$ : $u = \text{STF}(\mathcal{S}(x_t), \mathcal{P}(x_t))$ For video data, fusion extends across both spatial and temporal axes, allowing coherent modeling of visual sequences.

The resulting unified sequence—e.g., $[BOS] \;\;$ {Text} $[BOI]$ {Image} $[EOI]$ {Text} ... $[EOS]$ —is processed by a transformer with omni-attention, enabling causal modeling over the entire multimodal context.

3. Native Multimodal Generation: Autoregressive and Flow Matching Heads

Show-o2 decouples learning for language and visual generation using specialized heads:

Language Head (Autoregressive Modeling):
- Standard next-token prediction objective applies causal attention for textual tokens.
Flow Head (Flow Matching for Image/Video Generation):
- Models the time-derivative of visual latents ( $v_t = \frac{dx_t}{dt}$ ), employing transformer blocks with temporal conditioning (adaLN-Zero, as in DiT).
- The flow head supports full attention over unified visual tokens, promoting consistency and coherence in generated visual outputs.

The combined training objective is: $\mathcal{L} = \alpha \mathcal{L}_{\text{NTP}} + \mathcal{L}_{\text{FM}}$ where $\mathcal{L}_{\text{NTP}}$ is the LLMing loss and $\mathcal{L}_{\text{FM}}$ is the flow-matching loss; $\alpha$ is a balancing coefficient.

4. Scalable Two-Stage Training Paradigm

Show-o2 employs a two-stage training strategy to ensure both performance and scalability:

Stage 1: Visual Generative Pretraining
- Trains the projector, spatial-temporal fusion, and flow head components solely on visual generation objectives.
- Uses approximately 66M image-text pairs, expanding with interleaved and video-text data.
- Freezes the core language branch, maintaining language capabilities while enhancing visual generation.
Stage 2: Full Model Fine-Tuning
- Activates all model parameters (except for frozen VAE encoder/decoder).
- Trains on 9M multimodal instruction data and 16M high-quality generation examples, integrating multimodal reasoning, temporal understanding, and joint sequence generation.
- Flow heads can be initialized from smaller model checkpoints and adapted when scaling the language backbone (e.g., from 1.5B to 7B parameters) using lightweight adapters.

This two-stage approach allows efficient transfer of trained components to larger models, facilitating rapid scalability without retraining all parameters from scratch.

5. Multimodal Task Coverage and Performance

Show-o2 demonstrates versatility across a spectrum of benchmarks:

Multimodal Understanding: Outperforms or matches larger state-of-the-art models on datasets such as MME, GQA, SEED, MM-Bench, MMMU, MMStar, and AI2D, even at the 1.5B and 7B parameter scales.
Visual Generation: Excels on GenEval, DPG-Bench, and VBench for both image and video generation, often surpassing models trained on larger or more modality-specific corpora.
Mixed-Modality and Visual Storytelling: Capable of generating coherent, interleaved sequences of text and visual content, including image-to-video and text-to-video generation with temporal and content consistency.
Bilingual Support: Shows strong proficiency for both English and Chinese modalities in both understanding and generation scenarios.

6. Open Resources and Applicability

All code, pretrained model weights, and relevant development scripts are publicly maintained at https://github.com/showlab/Show-o, supporting further research and deployment. The unified, extensible approach supports a range of applications—from image and video captioning, question answering, and visual task reasoning to native multimodal generation pipelines for both academic research and industrial deployment.

Summary Table

Component	Description
3D Causal VAE	Unified latent representation for images and videos
Dual-Path Fusion	Semantic (SigLIP) + spatial projectors; spatial-temporal fusion mechanism
Autoregressive Modeling	Next-token prediction for text via language head
Flow Matching	Image/video generative modeling via flow head and velocity prediction
Two-Stage Training	Stage 1: visual generative pretraining; Stage 2: full fine-tuning for understanding
Multimodal Capabilities	Text/image/video understanding, mixed-modality generation, temporal reasoning
Code Availability	https://github.com/showlab/Show-o

Conclusion

Show-o2 establishes a native, scalable framework for joint multimodal generation and understanding, with a modular design that incorporates causal VAE-based visual representation, dual-path fusion, and complementary language and flow objectives. Through a carefully orchestrated training process and release of public resources, Show-o2 is positioned as a foundational model for unified multimodal research and deployment across text, image, and video domains.

PDF Markdown Bookmark Chat (Pro)