ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer (2410.00086v2)

Published 30 Sep 2024 in cs.CV and cs.AI

Abstract: Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal LLM. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents. Code and models will be available on the project page: https://ali-vilab.github.io/ace-page/.

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that ACE unifies visual creation and editing by integrating various tasks into a single diffusion Transformer framework.
It introduces a Long-context Condition Unit and streamlined tokenizing methods to effectively process multi-modal inputs and historical context.
The evaluation establishes strong performance via quantitative metrics and qualitative studies, setting a benchmark for future generative models.

An Overview of ACE: An All-round Creator and Editor via Diffusion Transformer

The paper introduces ACE (All-round Creator and Editor), a generative model that operates through a novel Transformer-based diffusion model. ACE attempts to unify a variety of visual generation and editing tasks under a single framework, drawing parallels with the comprehensive capabilities LLMs like GPT-4 possess in NLP. The model utilizes a Long-context Condition Unit (LCU) to accommodate diverse inputs and tasks, achieving notable generalization in visual generation and editing fields. ACE constructs a sophisticated multi-modal input paradigm, processes training data through a detailed extraction and synthesis pipeline, and establishes an evaluation benchmark to gauge its performance against specialized models.

Unified Framework Proposal

The unification in ACE is achieved via the definition of a Long-context Condition Unit (LCU). This unit serves to standardize inputs across different tasks:

Text-guided generation: Generating images based primarily on textual descriptions.
Controllable generation: Including inputs like segmentation maps and depth maps to direct the generation process more precisely.
Semantic editing: Adjusting semantic attributes within images, such as facial features.
Object and text editing: Allowing for precise insertion or deletion of specific elements within images.
Repainting and Layer Editing: Focused on image inpainting techniques and decomposition or merging of image layers respectively.
Reference Generation: Utilizes multiple input images for more complex tasks.

This use of standardized input formats ensures flexibility and uniformity across a wide array of visual generation tasks, something previous foundational models could not support simultaneously.

Architecture and Method Implementation

ACE is built around a Diffusion Transformer model that utilizes a specially designed tokenizing and embedding strategy:

Condition Tokenizing: Transforms various input formats into a unified visual sequence and textual sequence by tokenizing images and instructions.
Image Indicator Embedding: Distinguishes between multiple images in textual instructions using indicator embeddings, thereby addressing ambiguities in multi-image tasks.
Long-context Attention Block: Enforces a detailed attention mechanism considering both spatial and frame-level position embeddings to effectively process historical context in multi-turn tasks.

By integrating these components, ACE can handle multiple tasks concurrently, leveraging past interactions to better understand and fulfill user instructions.

Data Collection and Processing

A significant contribution of the paper is the sophisticated data pipeline employed:

Synthesizing: Using powerful open-source models to generate high-quality training pairs.
Pairing from Databases: Implementing hierarchical clustering methods to mine paired images from large-scale datasets and ensure content fidelity.
Instruction Labeling: Leveraging MLLMs to automate the process, thereby achieving diversity and precision in instructions across various tasks.

This pipeline addresses the common challenge of scarce and diverse training data in visual generation tasks, ultimately feeding ACE around 0.7 billion data pairs that support an array of tasks.

Evaluation and Benchmarking

The paper establishes the ACE benchmark, a dataset encompassing a range of tasks to more comprehensively evaluate the performance of generative models. Compared with existing benchmarks like MagicBrush and Emu Edit, ACE offers broader task coverage and more comprehensive evaluations:

Quantitative Metrics: Metrics like L1 distance, CLIP similarity, and DINO similarity are utilized to assess generated image quality.
Qualitative User Studies: High marks in prompt-following ability and image aesthetics indicate competitive performance in both single and multi-turn tasks.

Implications and Future Work

The implications of ACE are multifaceted:

Practical: It introduces a streamlined method for visual content generation and editing, significantly reducing the complexity compared to multi-model pipelines.
Theoretical: It pushes the boundaries of generative models by proposing a unified framework that could potentially be expanded to other modalities.

Future developments could include scaling the model for larger, high-quality training datasets and implementing more complex input conditions, further enhancing the model’s capability to handle intricate visual generation tasks.

Conclusion

ACE demonstrates a significant step towards unifying visual generation tasks with a comprehensive, flexible model architecture. By amalgamating a variety of tasks into a single model framework and emphasizing the importance of high-quality data and instruction diversity, ACE sets the stage for future advancements in AI-driven content creation. This paper not only presents robust methodology and promising results but also lays a solid foundation for future research in multi-modal generative modeling.

This work opens up new avenues in the efficient deployment of generative models across a broad spectrum of tasks, indicative of a future where AI models like ACE become integral tools in creative and professional settings.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (8)

GitHub

Tweets

https://twitter.com/_akhaliq/status/1841308068581081574

https://twitter.com/_vztu/status/1842883866509692960

https://twitter.com/DigThatData/status/1841311271150371311

https://twitter.com/souldeathnie/status/1852273416516321740

https://twitter.com/JfkWhitlam/status/1841453809111052475

https://twitter.com/JfkWhitlam/status/1841511514517475403

YouTube

Show All Videos