Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 95 TPS
Gemini 2.5 Pro 47 TPS Pro
GPT-5 Medium 29 TPS
GPT-5 High 33 TPS Pro
GPT-4o 102 TPS
GPT OSS 120B 471 TPS Pro
Kimi K2 192 TPS Pro
2000 character limit reached

Show-o2: Unified Multimodal Transformer

Updated 30 June 2025
  • Show-o2 is an advanced unified multimodal transformer that fuses autoregressive text modeling with flow matching for visual generation in a 3D causal VAE framework.
  • The architecture employs a dual-path fusion mechanism that integrates semantic and spatial-temporal features, ensuring coherent handling of text, image, and video modalities.
  • Its scalable two-stage training paradigm and open resources deliver robust performance on tasks like visual question answering, storytelling, and mixed-modality data understanding.

The Show-o2 Model is an advanced unified multimodal transformer architecture designed for effective understanding and generation across text, image, and video modalities. It achieves native multimodal handling by integrating autoregressive modeling for language with flow matching for visual generative processes. Central to Show-o2 is a scalable dual-path fusion mechanism within a 3D causal variational autoencoder (VAE) space, supporting both spatial and temporal data, and enabling seamless extension from images to videos. This design facilitates high adaptability for a broad spectrum of multimodal tasks, from visual question answering to coherent visual storytelling. The complete implementation and model weights are openly available at https://github.com/showlab/Show-o.

1. Model Architecture

Show-o2 introduces a 3D causal VAE as its foundational latent space, unifying the representation of both static images and dynamic video sequences. The encoder processes raw images or video frames, outputting latent tensors xtx_t that reflect both spatial and temporal structure. For generative purposes, noise scheduling interpolates between sampled noise and latent codes: xt=tx1+(1t)x0x_t = t \cdot x_1 + (1 - t) \cdot x_0 where x0N(0,1)x_0 \sim \mathcal{N}(0,1) and tt ranges from 0 to 1. Decoding leverages a learned 3D causal VAE decoder, reconstructing visual data while preserving consistency with the input's modality and structure.

The architecture supports interleaved sequences of text and unified visual representations. Text tokens are embedded via a dedicated language branch; visual tokens—comprising both semantic and low-level features—are fused and interleaved with textual data, enabling causal multimodal attention through the sequence.

2. Dual-Path Spatial-Temporal Fusion Mechanism

To ensure rich multimodal representations, Show-o2 employs a dual-path fusion strategy:

  • Semantic Path (S()\mathcal{S}(\cdot)): Utilizes Vision Transformer layers, adapted from SigLIP, applied to 3D VAE latents for extracting high-level semantics. Distillation loss aligns these features with pre-trained SigLIP outputs for the same image:

Ldistill=logsim(S(xt),SigLIP(X))\mathcal{L}_{\text{distill}} = -\sum \log \text{sim}(\mathcal{S}(x_t), \text{SigLIP}(X))

where sim\text{sim} denotes cosine similarity and XX is the raw input image.

  • Projector Path (P()\mathcal{P}(\cdot)): Embeds low-level spatial (and, for video, temporal) features from the VAE tensor.

Fusion proceeds by concatenating outputs along the feature dimension and processing them through normalization (RMSNorm) and multilayer perceptrons, resulting in unified visual representations uu: u=STF(S(xt),P(xt))u = \text{STF}(\mathcal{S}(x_t), \mathcal{P}(x_t)) For video data, fusion extends across both spatial and temporal axes, allowing coherent modeling of visual sequences.

The resulting unified sequence—e.g., [BOS]    [BOS] \;\; {Text} [BOI][BOI] {Image} [EOI][EOI] {Text} ... [EOS][EOS]—is processed by a transformer with omni-attention, enabling causal modeling over the entire multimodal context.

3. Native Multimodal Generation: Autoregressive and Flow Matching Heads

Show-o2 decouples learning for language and visual generation using specialized heads:

  • Language Head (Autoregressive Modeling):
    • Standard next-token prediction objective applies causal attention for textual tokens.
  • Flow Head (Flow Matching for Image/Video Generation):
    • Models the time-derivative of visual latents (vt=dxtdtv_t = \frac{dx_t}{dt}), employing transformer blocks with temporal conditioning (adaLN-Zero, as in DiT).
    • The flow head supports full attention over unified visual tokens, promoting consistency and coherence in generated visual outputs.

The combined training objective is: L=αLNTP+LFM\mathcal{L} = \alpha \mathcal{L}_{\text{NTP}} + \mathcal{L}_{\text{FM}} where LNTP\mathcal{L}_{\text{NTP}} is the LLMing loss and LFM\mathcal{L}_{\text{FM}} is the flow-matching loss; α\alpha is a balancing coefficient.

4. Scalable Two-Stage Training Paradigm

Show-o2 employs a two-stage training strategy to ensure both performance and scalability:

  1. Stage 1: Visual Generative Pretraining
    • Trains the projector, spatial-temporal fusion, and flow head components solely on visual generation objectives.
    • Uses approximately 66M image-text pairs, expanding with interleaved and video-text data.
    • Freezes the core language branch, maintaining language capabilities while enhancing visual generation.
  2. Stage 2: Full Model Fine-Tuning
    • Activates all model parameters (except for frozen VAE encoder/decoder).
    • Trains on 9M multimodal instruction data and 16M high-quality generation examples, integrating multimodal reasoning, temporal understanding, and joint sequence generation.
    • Flow heads can be initialized from smaller model checkpoints and adapted when scaling the language backbone (e.g., from 1.5B to 7B parameters) using lightweight adapters.

This two-stage approach allows efficient transfer of trained components to larger models, facilitating rapid scalability without retraining all parameters from scratch.

5. Multimodal Task Coverage and Performance

Show-o2 demonstrates versatility across a spectrum of benchmarks:

  • Multimodal Understanding: Outperforms or matches larger state-of-the-art models on datasets such as MME, GQA, SEED, MM-Bench, MMMU, MMStar, and AI2D, even at the 1.5B and 7B parameter scales.
  • Visual Generation: Excels on GenEval, DPG-Bench, and VBench for both image and video generation, often surpassing models trained on larger or more modality-specific corpora.
  • Mixed-Modality and Visual Storytelling: Capable of generating coherent, interleaved sequences of text and visual content, including image-to-video and text-to-video generation with temporal and content consistency.
  • Bilingual Support: Shows strong proficiency for both English and Chinese modalities in both understanding and generation scenarios.

6. Open Resources and Applicability

All code, pretrained model weights, and relevant development scripts are publicly maintained at https://github.com/showlab/Show-o, supporting further research and deployment. The unified, extensible approach supports a range of applications—from image and video captioning, question answering, and visual task reasoning to native multimodal generation pipelines for both academic research and industrial deployment.

Summary Table

Component Description
3D Causal VAE Unified latent representation for images and videos
Dual-Path Fusion Semantic (SigLIP) + spatial projectors; spatial-temporal fusion mechanism
Autoregressive Modeling Next-token prediction for text via language head
Flow Matching Image/video generative modeling via flow head and velocity prediction
Two-Stage Training Stage 1: visual generative pretraining; Stage 2: full fine-tuning for understanding
Multimodal Capabilities Text/image/video understanding, mixed-modality generation, temporal reasoning
Code Availability https://github.com/showlab/Show-o

Conclusion

Show-o2 establishes a native, scalable framework for joint multimodal generation and understanding, with a modular design that incorporates causal VAE-based visual representation, dual-path fusion, and complementary language and flow objectives. Through a carefully orchestrated training process and release of public resources, Show-o2 is positioned as a foundational model for unified multimodal research and deployment across text, image, and video domains.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this topic yet.