- The paper introduces IterComp, an iterative feedback learning framework that synthesizes composition-aware preferences from multiple diffusion models to enhance compositional text-to-image generation.
- It constructs image-rank pair datasets evaluated on key metrics such as attribute binding, spatial relationships, and non-spatial relationships, achieving superior performance on T2I-CompBench.
- The approach demonstrates robustness by enabling multi-reward feedback learning and generalizing with models like RPG and Omost, offering a comprehensive optimization solution.
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation
The paper addresses the challenges in compositional text-to-image generation, emphasizing the disparate strengths exhibited by existing diffusion models like RPG, Stable Diffusion 3, and FLUX. While these models individually excel in specific attributes, such as attribute binding or spatial relationships, their singular focus highlights the need for an integrated framework that harmonizes these strengths. The introduction of IterComp, a novel iterative feedback learning framework, aims to achieve this comprehensive compositional enhancement.
The paper leverages IterComp to synthesize composition-aware preferences from multiple models, enabling iterative self-refinement of both the base diffusion model and reward models. This process is founded on a curated gallery of six open-source diffusion models, evaluated across three key compositional metrics: attribute binding, spatial relationships, and non-spatial relationships. The framework constructs a dataset of image-rank pairs to train composition-aware reward models, facilitating a closed-loop enhancement of compositionality over several iterations.
IterComp's methodology involves a detailed evaluation of model efficacy through human rankings, focusing on core compositional aspects. This results in a comprehensive composition-aware model preference dataset. The trained reward models inform a multi-reward feedback learning process, optimizing the base diffusion model's performance across complex generation scenarios.
The experimental results underscore IterComp's significant advancement over state-of-the-art methods, particularly in multi-category object composition and semantic alignment. The theoretical underpinnings demonstrate its efficacy, with the paper providing complete proofs of the iterative feedback learning's effectiveness. Notably, IterComp's architecture also facilitates broad generalization, integrating efficiently with other models like RPG and Omost.
Quantitative evaluation on the T2I-CompBench illustrates IterComp's superiority, showcasing significant improvements in attributes like color, shape, and texture binding as well as spatial and non-spatial relationships. Enhanced realism and aesthetic quality, as evidenced by higher scores on CLIP and Aesthetic metrics, further affirm IterComp's capability.
A key aspect of IterComp's innovation lies in its ability to perform iterative feedback learning, progressively refining the base diffusion model by leveraging diverse preferences. The unified optimization framework, with derived gradient objectives, illustrates how the approach effectively distinguishes between high and low-quality outputs, continuously aligning the model's generation process with preferred compositional outcomes.
While Diffusion-DPO and ImageReward have made strides in aligning diffusion models with human preferences, IterComp distinctly emphasizes compositional awareness, making it a robust backbone for tackling complex text-to-image generation tasks. By iteratively enhancing the base model and integrating multiple composition-aware models, it provides an efficient and comprehensive solution to the longstanding challenges in this domain.
The paper concludes with a vision for incorporating more complex modalities and extending IterComp's application in practical scenarios, marking a promising direction for the future of AI-based generative models. This research enriches the theoretical and practical understanding of diffusion model optimization, offering substantial insights and methodologies for future applications.