Show-o2 Model: A Unified Multimodal Transformer
The Show-o2 Model is an advanced unified multimodal transformer architecture designed for effective understanding and generation across text, image, and video modalities. It achieves native multimodal handling by integrating autoregressive modeling for language with flow matching for visual generative processes. Central to Show-o2 is a scalable dual-path fusion mechanism within a 3D causal variational autoencoder (VAE) space, supporting both spatial and temporal data, and enabling seamless extension from images to videos. This design facilitates high adaptability for a broad spectrum of multimodal tasks, from visual question answering to coherent visual storytelling. The complete implementation and model weights are openly available at https://github.com/showlab/Show-o.
1. Model Architecture
Show-o2 introduces a 3D causal VAE as its foundational latent space, unifying the representation of both static images and dynamic video sequences. The encoder processes raw images or video frames, outputting latent tensors that reflect both spatial and temporal structure. For generative purposes, noise scheduling interpolates between sampled noise and latent codes: where and ranges from 0 to 1. Decoding leverages a learned 3D causal VAE decoder, reconstructing visual data while preserving consistency with the input's modality and structure.
The architecture supports interleaved sequences of text and unified visual representations. Text tokens are embedded via a dedicated language branch; visual tokens—comprising both semantic and low-level features—are fused and interleaved with textual data, enabling causal multimodal attention through the sequence.
2. Dual-Path Spatial-Temporal Fusion Mechanism
To ensure rich multimodal representations, Show-o2 employs a dual-path fusion strategy:
- Semantic Path (): Utilizes Vision Transformer layers, adapted from SigLIP, applied to 3D VAE latents for extracting high-level semantics. Distillation loss aligns these features with pre-trained SigLIP outputs for the same image:
where denotes cosine similarity and is the raw input image.
- Projector Path (): Embeds low-level spatial (and, for video, temporal) features from the VAE tensor.
Fusion proceeds by concatenating outputs along the feature dimension and processing them through normalization (RMSNorm) and multilayer perceptrons, resulting in unified visual representations : For video data, fusion extends across both spatial and temporal axes, allowing coherent modeling of visual sequences.
The resulting unified sequence—e.g., {Text} {Image} {Text} ... —is processed by a transformer with omni-attention, enabling causal modeling over the entire multimodal context.
3. Native Multimodal Generation: Autoregressive and Flow Matching Heads
Show-o2 decouples learning for language and visual generation using specialized heads:
- Language Head (Autoregressive Modeling):
- Standard next-token prediction objective applies causal attention for textual tokens.
- Flow Head (Flow Matching for Image/Video Generation):
- Models the time-derivative of visual latents (), employing transformer blocks with temporal conditioning (adaLN-Zero, as in DiT).
- The flow head supports full attention over unified visual tokens, promoting consistency and coherence in generated visual outputs.
The combined training objective is: where is the LLMing loss and is the flow-matching loss; is a balancing coefficient.
4. Scalable Two-Stage Training Paradigm
Show-o2 employs a two-stage training strategy to ensure both performance and scalability:
- Stage 1: Visual Generative Pretraining
- Trains the projector, spatial-temporal fusion, and flow head components solely on visual generation objectives.
- Uses approximately 66M image-text pairs, expanding with interleaved and video-text data.
- Freezes the core language branch, maintaining language capabilities while enhancing visual generation.
- Stage 2: Full Model Fine-Tuning
- Activates all model parameters (except for frozen VAE encoder/decoder).
- Trains on 9M multimodal instruction data and 16M high-quality generation examples, integrating multimodal reasoning, temporal understanding, and joint sequence generation.
- Flow heads can be initialized from smaller model checkpoints and adapted when scaling the language backbone (e.g., from 1.5B to 7B parameters) using lightweight adapters.
This two-stage approach allows efficient transfer of trained components to larger models, facilitating rapid scalability without retraining all parameters from scratch.
5. Multimodal Task Coverage and Performance
Show-o2 demonstrates versatility across a spectrum of benchmarks:
- Multimodal Understanding: Outperforms or matches larger state-of-the-art models on datasets such as MME, GQA, SEED, MM-Bench, MMMU, MMStar, and AI2D, even at the 1.5B and 7B parameter scales.
- Visual Generation: Excels on GenEval, DPG-Bench, and VBench for both image and video generation, often surpassing models trained on larger or more modality-specific corpora.
- Mixed-Modality and Visual Storytelling: Capable of generating coherent, interleaved sequences of text and visual content, including image-to-video and text-to-video generation with temporal and content consistency.
- Bilingual Support: Shows strong proficiency for both English and Chinese modalities in both understanding and generation scenarios.
6. Open Resources and Applicability
All code, pretrained model weights, and relevant development scripts are publicly maintained at https://github.com/showlab/Show-o, supporting further research and deployment. The unified, extensible approach supports a range of applications—from image and video captioning, question answering, and visual task reasoning to native multimodal generation pipelines for both academic research and industrial deployment.
Summary Table
Component | Description |
---|---|
3D Causal VAE | Unified latent representation for images and videos |
Dual-Path Fusion | Semantic (SigLIP) + spatial projectors; spatial-temporal fusion mechanism |
Autoregressive Modeling | Next-token prediction for text via language head |
Flow Matching | Image/video generative modeling via flow head and velocity prediction |
Two-Stage Training | Stage 1: visual generative pretraining; Stage 2: full fine-tuning for understanding |
Multimodal Capabilities | Text/image/video understanding, mixed-modality generation, temporal reasoning |
Code Availability | https://github.com/showlab/Show-o |
Conclusion
Show-o2 establishes a native, scalable framework for joint multimodal generation and understanding, with a modular design that incorporates causal VAE-based visual representation, dual-path fusion, and complementary language and flow objectives. Through a carefully orchestrated training process and release of public resources, Show-o2 is positioned as a foundational model for unified multimodal research and deployment across text, image, and video domains.