Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation (2505.05472v2)

Published 8 May 2025 in cs.CV

Abstract: Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.

PDF Abstract

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

The paper introduces Mogao, a sophisticated framework designed to advance the capabilities of foundation models in interleaved multi-modal generation. In contrast to existing models constrained to single-modal outputs, Mogao presents an innovative causal approach that enables the simultaneous generation of text and images. This model integrates state-of-the-art techniques in both architecture design and training strategies to optimize performance in both understanding and generating multi-modal content.

Architecture and Design

Mogao leverages a hybrid architecture that combines autoregressive models for text generation with diffusion models for image synthesis. This integration utilizes modality-specific query-key-value (QKV) attention mechanisms and feed-forward networks (FFN), which allow separate processing of text and visual inputs while maintaining a unified transformer backbone. The paper highlights several architectural innovations:

Dual Visual Encoders: Mogao employs both a Vision Transformer (ViT) for semantic understanding and a Variational Autoencoder (VAE) for image generation. This dual encoder system ensures high-level semantic consistency in generated content while effectively processing visual inputs.
Interleaved Rotary Position Embeddings: A novel embedding strategy is introduced to capture the spatial and temporal positional information necessary for handling interleaved sequences of text and images. This design improves contextual alignment across modalities.
Multi-modal Classifier-Free Guidance: By using multiple conditioned inputs, Mogao refines image generation precision and mitigates issues related to repetitive image generation often encountered in classifier-free guidance.

Training Strategy

The efficiency and robustness of Mogao’s training process are underpinned by a large-scale, in-house curated dataset comprising interleaved text and image sequences. The training regimen is characterized by:

Adaptive Loss Balancing: The paper discusses an innovative approach to integrating next-token prediction loss for textual tokens with diffusion loss for visual tokens, ensuring balanced optimization across both modalities.
Dynamic Causal Attention: Efficient Complete Teacher Forcing (ECTF) is employed to bridge the training-inference gap that typically arises due to noise perturbations in visual generation tasks and ensures efficient processing of long sequences.

Experimental Results

Mogao's performance is thoroughly vetted against state-of-the-art benchmarks for both image generation and multi-modal understanding. The model achieves high scores across diverse evaluation metrics, including GenEval, DPG-Bench, and GenAI-Bench, demonstrating its robustness in text-to-image alignment, object comprehension, and logical deduction tasks.

For multi-modal understanding, Mogao exemplifies superior performance on POPE, MME, and SEEDBench, among other benchmarks, underscoring its capability to integrate visual and textual modalities effectively. The model’s emergent abilities in zero-shot image editing highlight its practical applicability and robustness in real-world scenarios.

Implications and Future Directions

Mogao sets a precedent for future research in omni-modal systems, showcasing the potential of unified models to transcend traditional single-modal limitations. Its capability to seamlessly engage with interleaved multi-modal tasks paves the way for more sophisticated artificial intelligence applications that require a deep integration of heterogeneous data types.

The developments seen in Mogao suggest promising avenues for future exploration, including enhancing model scalability, refining training paradigms to further reduce computational overhead, and building upon its existing architecture to diversify the range of interleaved modalities it can handle.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Chao Liao (13 papers)
Liyang Liu (12 papers)
Xun Wang (96 papers)
Zhengxiong Luo (16 papers)
Xinyu Zhang (296 papers)
Wenliang Zhao (22 papers)
Jie Wu (230 papers)
Liang Li (297 papers)
Zhi Tian (68 papers)
Weilin Huang (61 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/bohannon_bot/status/1921260829619388712

https://twitter.com/Lrzjason/status/1925679691979006178

YouTube

Show All Videos