Relation Rectification in Diffusion Model (2403.20249v1)
Abstract: Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: https://wuyinwei-hah.github.io/rrnet.github.io/.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. Sep 2023.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. Jan 2023.
- Training-free layout control with cross-attention guidance. Apr 2023.
- Debiasing vision-language models via biased prompts. Jan 2023.
- Diffusion models beat gans on image synthesis. Neural Information Processing Systems,Neural Information Processing Systems, Dec 2021.
- Stable diffusion is unstable. Advances in Neural Information Processing Systems, 36, 2023.
- Training-free structured diffusion guidance for compositional text-to-image synthesis.
- An image is worth one word: Personalizing text-to-image generation using textual inversion.
- Benchmarking spatial relationships in text-to-image generation.
- Vector quantized diffusion model for text-to-image synthesis. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022.
- Inductive representation learning on large graphs. Neural Information Processing Systems,Neural Information Processing Systems, Jun 2017.
- Incorporating structured representations into pretrained vision & language models using scene graphs. May 2023.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems,Neural Information Processing Systems, Jan 2017.
- Denoising diffusion probabilistic models.
- Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv: Computation and Language,arXiv: Computation and Language, Jun 2021.
- Structure-clip: Enhance multi-modal language representations with structure knowledge. arXiv preprint arXiv:2305.06152, 2023.
- Reversion: Diffusion-based relation inversion from images. Mar 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. Cornell University - arXiv,Cornell University - arXiv, Feb 2021.
- Learning graph neural networks for image style transfer. Jul 2022.
- Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
- Semi-supervised classification with graph convolutional networks. arXiv: Learning,arXiv: Learning, Sep 2016.
- Multi-concept customization of text-to-image diffusion. Dec 2022.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.
- Graph signal processing, graph neural network and graph learning on biological data: A systematic review. IEEE Reviews in Biomedical Engineering, page 109–135, Jan 2023.
- Gligen: Open-set grounded text-to-image generation. Jan 2023.
- Microsoft COCO: Common Objects in Context, page 740–755. Jan 2014.
- Visual instruction tuning.
- Compositional visual generation with composable diffusion models. Jun 2022.
- GRAPH ATTENTION NETWORKS, page 39–41. Jan 2020.
- Decoupled weight decay regularization, 2019.
- Improved denoising diffusion probabilistic models. Cornell University - arXiv,Cornell University - arXiv, Feb 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models.
- Learning transferable visual models from natural language supervision. Cornell University - arXiv,Cornell University - arXiv, Feb 2021.
- Hierarchical text-conditional image generation with clip latents.
- Stochastic graph as a model for social networks. Computers in Human Behavior, 64:621–640, Nov 2016.
- High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. Aug 2022.
- Photorealistic text-to-image diffusion models with deep language understanding.
- Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
- Winoground: Probing vision and language models for visio-linguistic compositionality.
- Linear spaces of meanings: Compositional structures in vision-language models. Feb 2023.
- Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019.
- Heterogeneous graph attention network. In The World Wide Web Conference, May 2019.
- Patch diffusion: Faster and more data-efficient training of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- Graph neural networks for natural language processing: A survey., Jan 2023.
- Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. Jul 2023.
- Diffusion probabilistic model made slim. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22552–22562, 2023.
- Deep model reassembly. Advances in neural information processing systems, 35:25739–25753, 2022.
- When and why vision-language models behave like bag-of-words models, and what to do about it? Oct 2022.
- Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery; Data Mining, Jul 2019.
- Adding conditional control to text-to-image diffusion models.