OmniGen: Unified Image Generation (2409.11340v2)

Published 17 Sep 2024 in cs.CV and cs.AI

Abstract: The emergence of LLMs has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https://github.com/VectorSpaceLab/OmniGen to foster future advancements.

Authors (10)

Shitao Xiao (38 papers)
Yueze Wang (14 papers)
Junjie Zhou (28 papers)
Huaying Yuan (9 papers)
Xingrun Xing (13 papers)
Ruiran Yan (5 papers)
Shuting Wang (11 papers)
Tiejun Huang (130 papers)
Zheng Liu (312 papers)
Chaofan Li (13 papers)

Citations (14)

View on Semantic Scholar

Summary

Unified Image Generation with OmniGen: An Expert Overview

The paper "OmniGen: Unified Image Generation," authored by Shitao Xiao et al., introduces a pioneering approach in the field of visual generation models. The research addresses a significant gap by proposing a unified model framework, OmniGen, which is capable of handling a diverse array of image generation tasks. This work sets a precedent by illustrating the feasibility and advantages of a generalized approach in image generation, akin to the versatility demonstrated by LLMs in NLP.

Key Features of OmniGen

OmniGen distinguishes itself through three primary features: unification, simplicity, and knowledge transfer.

Unification: OmniGen demonstrates the ability to perform a variety of tasks including text-to-image generation, image editing, subject-driven generation, and visual-conditional generation within a single model framework. This integrative approach extends the model's capabilities to encompass traditional computer vision tasks, redefined as image generation tasks. This feature contrasts with the modular extensions observed in other diffusion models like ControlNet and IP-Adapter.
Simplicity: The architecture of OmniGen is streamlined, utilizing a combinatory structure of a Variational Autoencoder (VAE) and a transformer model without additional encoders. This design is intended to be more user-friendly and cost-efficient by eliminating the need for extra preprocessing steps. The model accepts any modality of text and image inputs, which simplifies the workflow significantly.
Knowledge Transfer: OmniGen's training on a unified dataset format allows it to transfer knowledge effectively across different tasks. This capability enables the model to handle unseen tasks and domains, demonstrating novel abilities, including reasoning and in-context learning, reminiscent of capabilities seen in LLMs.

Performance Evaluation and Results

The efficacy of OmniGen is underscored by its strong performance across multiple benchmarks and tasks:

Text-to-Image Generation: On the GenEval benchmark, OmniGen achieves competitive results compared to state-of-the-art models such as Stable Diffusion 3 (SD3) and DALLE-3, despite its relatively smaller parameter size and training data. The model's architecture promotes efficient parameter utilization, further bolstering its competitiveness.
Image Editing: Evaluated on the EMU-Edit dataset, OmniGen demonstrates performance on par with specialized models like EMU-Edit, particularly in maintaining image integrity and adhering to textual instructions.
Subject-Driven Generation: On the DreamBench, OmniGen exhibits superior subject fidelity and competitive text fidelity compared to models that require fine-tuning, highlighting its generalization capabilities without specific training for new entities.
Visual Conditional Controls: OmniGen maintains high performance in various visually conditioned tasks, such as segmentation mask and edge map generation, outperforming models like ControlNet and ControlNet++ in specific benchmarks.

Emerging Capabilities and Reasoning

OmniGen's architecture and training paradigm endow it with several emergent capabilities:

Task Composition: The model successfully handles composite instructions spanning multiple tasks within a single prompt, showcasing its versatility.
Implicit Task Combination: By leveraging its learned knowledge, OmniGen can perform implicit task compositions without explicit preprocessing, reducing the need for additional model components and operations.
In-Context Learning for Unseen Tasks: The model demonstrates effective in-context learning abilities, extending its application to novel tasks and improving performance in new domains through example-based prompts.
Reasoning and CoT: OmniGen exhibits reasoning capabilities by identifying and manipulating specific objects based on textual instructions. Preliminary exploration of a step-by-step generation process suggests potential applications of Chain-of-Thought (CoT) methodologies in image generation, although further optimization is required.

Implications and Future Directions

OmniGen's unified approach paves the way for more integrative and efficient systems in AI-driven image generation. The simplified architecture and its versatility in handling a wide range of tasks present substantial practical benefits, particularly in reducing complexity and cost in real-world applications. The model's capabilities in emergent tasks and reasoning suggest promising directions for future research, including deeper exploration of process supervision and CoT methods to enhance image generation quality and complexity handling. Additionally, the model's framework could be extended to incorporate text generation, further blending the capabilities of LLMs and image generation models into a truly universal generative foundation.

In conclusion, "OmniGen: Unified Image Generation" represents a significant contribution to the field of AI-driven visual generation, offering a robust and flexible solution that challenges and extends the boundaries of current diffusion models. The insights and methods proposed in this paper hold substantial potential for further advancements in both theoretical and practical aspects of AI and image generation technologies.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - VectorSpaceLab/OmniGen (641 stars)

Tweets

https://twitter.com/cocktailpeanut/status/1849201053440327913

https://twitter.com/cloneofsimo/status/1854182629375148187

https://twitter.com/_akhaliq/status/1836231373931036733

https://twitter.com/NielsRogge/status/1848830293030961523

https://twitter.com/bdsqlsz/status/1836292760271532400

https://twitter.com/dimitrizho/status/1849405878274818290