OmniGen2: Exploration to Advanced Multimodal Generation (2506.18871v2)

Published 23 Jun 2025 in cs.CV, cs.AI, and cs.CL

Abstract: In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2

Summary

The paper introduces decoupled text and image decoding pathways—autoregressive for text and diffusion-based for images—to enhance multimodal generation quality.
It employs comprehensive data curation strategies including video-based training, reflection data, and the OmniContext benchmark to boost in-context generation and image editing.
Empirical results show OmniGen2 achieves strong text-to-image synthesis and state-of-the-art performance on editing and generation tasks with efficient parameterization.

OmniGen2: A Unified Framework for Advanced Multimodal Generation

OmniGen2 presents a significant advancement in the design and implementation of unified multimodal generative models, targeting a broad spectrum of tasks including text-to-image synthesis, image editing, and in-context (subject-driven) generation. The model is open-source and emphasizes both architectural innovation and comprehensive data curation, with a focus on practical deployment and extensibility.

Architectural Innovations

OmniGen2 departs from the parameter-sharing paradigm of its predecessor, OmniGen, by introducing two distinct decoding pathways for text and image modalities. The text pathway is autoregressive, while the image pathway is diffusion-based, each with unshared parameters. This decoupling is motivated by empirical findings that parameter sharing between text and image branches degrades image generation quality, even when initialized from strong LLMs such as Qwen2.5-VL.

Key architectural features include:

Decoupled Image Tokenizer: The model employs a VAE-based tokenizer exclusively for the diffusion image decoder, while the MLLM (initialized from Qwen2.5-VL-3B) processes images via a ViT encoder. This avoids the need to re-adapt VAE inputs for the MLLM, preserving its original text generation and multimodal understanding capabilities.
Conditional Diffusion Decoding: Instead of compressing all conditional information into a fixed set of learnable query tokens (as in MetaQuery and BLIP-3o), OmniGen2 leverages the full hidden states of the MLLM as conditioning for the diffusion decoder. This approach improves the model's ability to handle long and complex prompts without information loss.
Omni-RoPE Position Embedding: A novel 3D rotary position embedding is introduced, decomposing position into a sequence/modality identifier and 2D spatial coordinates. This design enables precise distinction between different images and supports spatial consistency, which is critical for image editing and in-context generation tasks.

Data Construction and Benchmarking

Recognizing the limitations of existing open-source datasets—particularly for image editing and in-context generation—OmniGen2 introduces comprehensive data pipelines:

Video-based Data Generation: By leveraging video data, the model constructs training pairs that capture the same subject under diverse conditions, facilitating robust in-context and editing capabilities. The pipeline integrates object detection (GroundingDINO), segmentation (SAM2), and VLM-based filtering to ensure subject consistency and diversity.
Reflection Data: Inspired by recent advances in LLM self-reflection, OmniGen2 curates a reflection dataset where the model iteratively generates images, critiques its outputs, and proposes corrections. This enables the integration of reasoning and self-improvement mechanisms into multimodal generation.
OmniContext Benchmark: To address the lack of standardized evaluation for in-context generation, the OmniContext benchmark is introduced. It covers eight task categories across character, object, and scene contexts, with metrics for prompt following and subject consistency, evaluated using GPT-4.1.

Empirical Results

OmniGen2 demonstrates strong numerical results across multiple domains:

Text-to-Image Generation: On GenEval, OmniGen2 achieves an overall score of 0.86 (with LLM rewriter), surpassing UniWorld-V1 and approaching the performance of larger models like BAGEL, despite using only 4B trainable parameters and a fraction of the training data.
Image Editing: The model attains state-of-the-art or near state-of-the-art results on Emu-Edit, GEdit-Bench-EN, and ImgEdit-Bench, excelling in both instruction-following and preservation of unedited regions.
In-Context Generation: On the OmniContext benchmark, OmniGen2 achieves an overall score of 7.18, outperforming all open-source baselines in both prompt following and subject consistency across single, multiple, and scene-based tasks.

Practical Implications

OmniGen2's decoupled architecture and efficient parameterization make it highly suitable for real-world deployment, especially in resource-constrained environments. The model's ability to maintain strong text generation and multimodal understanding, while excelling in image generation and editing, positions it as a versatile foundation for downstream applications such as:

Interactive creative tools (e.g., AI-assisted design, photo editing)
Personalized content generation (e.g., subject-driven avatars, product customization)
Automated visual reasoning and report generation in enterprise settings

The open-sourcing of models, code, datasets, and data pipelines further facilitates reproducibility and community-driven research.

Limitations and Future Directions

Despite its strengths, OmniGen2 exhibits several limitations:

Language Bias: The model performs better on English prompts than on Chinese, indicating a need for more balanced multilingual data.
Generalization Gaps: It struggles with certain instructions (e.g., body shape modification) due to data scarcity.
Input Sensitivity: Output quality degrades with low-quality or ambiguous input images.
Ambiguity in Multi-Image Inputs: Explicit prompt-image correspondence is required for optimal performance.

The reflection mechanism, while promising, is limited by the scale of the MLLM and the amount of reflection data. Over-reflection and failure to revise outputs based on self-critique are observed failure modes.

Future work should explore:

Scaling the MLLM backbone and expanding reflection data to enhance reasoning and correction capabilities.
Incorporating reinforcement learning for online self-improvement.
Improving multilingual and cross-domain generalization.
Developing more robust handling of ambiguous or low-quality inputs.

Theoretical and Broader Impact

OmniGen2's decoupled design challenges the prevailing trend of parameter sharing in unified multimodal models, providing empirical evidence that modality-specific pathways yield superior performance for complex generation tasks. The integration of reflection mechanisms opens new avenues for self-improving generative systems, bridging the gap between LLM reasoning and visual generation.

The release of high-quality, diverse datasets and standardized benchmarks like OmniContext is expected to catalyze further research in controllable, reference-based image generation and editing, fostering progress toward more general and reliable multimodal AI systems.

PDF Markdown

Related Papers

GitHub

GitHub - VectorSpaceLab/OmniGen2: OmniGen2: Unified Image Understanding and Generation. (723 stars)

Tweets

https://twitter.com/AdinaYakup/status/1937509394679316924

https://twitter.com/deedydas/status/1938636942125785243

https://twitter.com/_akhaliq/status/1937348648397885896

https://twitter.com/javaeeeee1/status/1937456706901983670

https://twitter.com/TheTuringPost/status/1939846809918349658

https://twitter.com/XShapk/status/1938734748270239832

YouTube

Show All Videos

HackerNews

OmniGen2 (2 points, 0 comments)