IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation (2410.07171v2)

Published 9 Oct 2024 in cs.CV

Abstract: Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made notable strides in compositional text-to-image generation. However, these methods typically exhibit distinct strengths for compositional generation, with some excelling in handling attribute binding and others in spatial relationships. This disparity highlights the need for an approach that can leverage the complementary strengths of various models to comprehensively improve the composition capability. To this end, we introduce IterComp, a novel framework that aggregates composition-aware model preferences from multiple models and employs an iterative feedback learning approach to enhance compositional generation. Specifically, we curate a gallery of six powerful open-source diffusion models and evaluate their three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. Based on these metrics, we develop a composition-aware model preference dataset comprising numerous image-rank pairs to train composition-aware reward models. Then, we propose an iterative feedback learning method to enhance compositionality in a closed-loop manner, enabling the progressive self-refinement of both the base diffusion model and reward models over multiple iterations. Theoretical proof demonstrates the effectiveness and extensive experiments show our significant superiority over previous SOTA methods (e.g., Omost and FLUX), particularly in multi-category object composition and complex semantic alignment. IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation. Code: https://github.com/YangLing0818/IterComp

Citations (1)

View on Semantic Scholar

Summary

The paper introduces IterComp, an iterative feedback learning framework that synthesizes composition-aware preferences from multiple diffusion models to enhance compositional text-to-image generation.
It constructs image-rank pair datasets evaluated on key metrics such as attribute binding, spatial relationships, and non-spatial relationships, achieving superior performance on T2I-CompBench.
The approach demonstrates robustness by enabling multi-reward feedback learning and generalizing with models like RPG and Omost, offering a comprehensive optimization solution.

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

The paper addresses the challenges in compositional text-to-image generation, emphasizing the disparate strengths exhibited by existing diffusion models like RPG, Stable Diffusion 3, and FLUX. While these models individually excel in specific attributes, such as attribute binding or spatial relationships, their singular focus highlights the need for an integrated framework that harmonizes these strengths. The introduction of IterComp, a novel iterative feedback learning framework, aims to achieve this comprehensive compositional enhancement.

The paper leverages IterComp to synthesize composition-aware preferences from multiple models, enabling iterative self-refinement of both the base diffusion model and reward models. This process is founded on a curated gallery of six open-source diffusion models, evaluated across three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. The framework constructs a dataset of image-rank pairs to train composition-aware reward models, facilitating a closed-loop enhancement of compositionality over several iterations.

IterComp's methodology involves a detailed evaluation of model efficacy through human rankings, focusing on core compositional aspects. This results in a comprehensive composition-aware model preference dataset. The trained reward models inform a multi-reward feedback learning process, optimizing the base diffusion model's performance across complex generation scenarios.

The experimental results underscore IterComp's significant advancement over state-of-the-art methods, particularly in multi-category object composition and semantic alignment. The theoretical underpinnings demonstrate its efficacy, with the paper providing complete proofs of the iterative feedback learning's effectiveness. Notably, IterComp's architecture also facilitates broad generalization, integrating efficiently with other models like RPG and Omost.

Quantitative evaluation on the T2I-CompBench illustrates IterComp's superiority, showcasing significant improvements in attributes like color, shape, and texture binding as well as spatial and non-spatial relationships. Enhanced realism and aesthetic quality, as evidenced by higher scores on CLIP and Aesthetic metrics, further affirm IterComp's capability.

A key aspect of IterComp's innovation lies in its ability to perform iterative feedback learning, progressively refining the base diffusion model by leveraging diverse preferences. The unified optimization framework, with derived gradient objectives, illustrates how the approach effectively distinguishes between high and low-quality outputs, continuously aligning the model's generation process with preferred compositional outcomes.

While Diffusion-DPO and ImageReward have made strides in aligning diffusion models with human preferences, IterComp distinctly emphasizes compositional awareness, making it a robust backbone for tackling complex text-to-image generation tasks. By iteratively enhancing the base model and integrating multiple composition-aware models, it provides an efficient and comprehensive solution to the longstanding challenges in this domain.

The paper concludes with a vision for incorporating more complex modalities and extending IterComp's application in practical scenarios, marking a promising direction for the future of AI-based generative models. This research enriches the theoretical and practical understanding of diffusion model optimization, offering substantial insights and methodologies for future applications.

PDF Markdown

Related Papers

GitHub

GitHub - YangLing0818/IterComp: IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation (28 stars)

Tweets

https://twitter.com/_akhaliq/status/1844272544687509910

https://twitter.com/SmokeAwayyy/status/1844798320352395713

https://twitter.com/LingYang_PKU/status/1844228257832632379

https://twitter.com/advokat_ai/status/1844374276163678376

https://twitter.com/susumuota/status/1844529700959326648

https://twitter.com/arXivGPT/status/1844807998255911109