Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment (2305.19599v5)

Published 31 May 2023 in cs.CV and cs.AI

Abstract: Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from textual descriptions. However, these approaches have faced challenges in precisely aligning the generated visual content with the textual concepts described in the prompts. In this paper, we propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff, aimed at improving the alignment between text and images in text-to-image diffusion models. In the coarse semantic re-alignment phase, a novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. Subsequently, the fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view. Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the proposed two-stage coarse-to-fine semantic re-alignment method outperforms other baseline re-alignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  2. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  3. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022.
  4. Semantic image synthesis via adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 5706–5714, 2017.
  5. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  6. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in Neural Information Processing Systems, 34:3518–3532, 2021a.
  7. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021b.
  8. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  9. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  10. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
  11. Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626, 2022.
  12. Clipscore: A reference-free evaluation metric for image captioning, 2022.
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
  14. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  15. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47–1, 2022.
  16. Scaling up gans for text-to-image synthesis. arXiv preprint arXiv:2303.05511, 2023.
  17. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  18. Dense text-to-image generation with attention modulation. arXiv preprint arXiv:2308.12964, 2023.
  19. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022.
  20. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  21. Controllable text-to-image generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 2065–2075, 2019.
  22. Semantic-guided image augmentation with pre-trained models. arXiv preprint arXiv:2302.02070, 2023a.
  23. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  24. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  25. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  26. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  27. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  28. OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  30. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  31. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  32. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  33. Generating diverse high-fidelity images with vq-vae-2. In Advances in neural information processing systems, pages 14866–14876, 2019.
  34. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060–1069. PMLR, 2016.
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  36. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  37. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  38. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  39. What the DAAM: Interpreting stable diffusion using cross attention. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
  40. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020.
  41. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  42. Attention Is All You Need. In Neural Information Processing Systems, 2017.
  43. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
  44. Imagereward: Learning and evaluating human preferences for text-to-image generation, 2023.
  45. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  46. Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332, 2022.
  47. Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021.
  48. When and why vision-language models behave like bags-of-words, and what to do about it?, 2023.
  49. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 41(8):1947–1962, 2018.
  50. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023.
  51. M6-ufc: Unifying multi-modal controls for conditional image synthesis. arXiv preprint arXiv:2105.14211, 2021.
  52. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5810, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Guian Fang (9 papers)
  2. Zutao Jiang (8 papers)
  3. Jianhua Han (49 papers)
  4. Guansong Lu (20 papers)
  5. Hang Xu (205 papers)
  6. Shengcai Liao (46 papers)
  7. Xiaodan Liang (318 papers)
  8. Xiaojun Chang (148 papers)
Citations (1)