Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching (2404.03653v3)

Published 4 Apr 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Dongzhi Jiang (13 papers)
  2. Guanglu Song (45 papers)
  3. Xiaoshi Wu (10 papers)
  4. Renrui Zhang (100 papers)
  5. Dazhong Shen (22 papers)
  6. Zhuofan Zong (14 papers)
  7. Yu Liu (784 papers)
  8. Hongsheng Li (340 papers)
Citations (11)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com