Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models (2312.06059v1)

Published 11 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt, where the model might overlook or entirely fail to produce certain objects. Existing solutions often require customly tailored functions for each of these problems, leading to sub-optimal results, especially for complex prompts. Our work introduces a novel perspective by tackling this challenge in a contrastive context. Our approach intuitively promotes the segregation of objects in attention maps while also maintaining that pairs of related attributes are kept close to each other. We conduct extensive experiments across a wide variety of scenarios, each involving unique combinations of objects, attributes, and scenes. These experiments effectively showcase the versatility, efficiency, and flexibility of our method in working with both latent and pixel-based diffusion models, including Stable Diffusion and Imagen. Moreover, we publicly share our source code to facilitate further research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. A-star: Test-time attention segregation and retention for text-to-image synthesis. arXiv preprint arXiv:2306.14544, 2023.
  2. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022a.
  3. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022b.
  4. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  7. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  8. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), pages 539–546. IEEE, 2005.
  9. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  10. Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2023.
  11. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 1735–1742. IEEE, 2006.
  12. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  13. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  14. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  15. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  16. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
  17. Real-time neural style transfer for videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 783–791, 2017.
  18. Scaling up gans for text-to-image synthesis. In CVPR, 2023.
  19. Unifiedqa-v2: Stronger generalization via broader cross-format training. arXiv preprint arXiv:2202.12359, 2022.
  20. Dense text-to-image generation with attention modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7701–7711, 2023.
  21. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022.
  22. Divide & bind your attention for improved generative semantic nursing. arXiv preprint arXiv:2307.10864, 2023.
  23. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714, 2022.
  24. Design guidelines for prompt engineering text-to-image generative models. In CHI Conference on Human Factors in Computing Systems, pages 1–23, 2022.
  25. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  26. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  27. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  28. Teaching clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
  29. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023.
  30. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  32. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  33. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  34. High-resolution image synthesis with latent diffusion models, 2021.
  35. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  36. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  37. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  38. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  39. What the daam: Interpreting stable diffusion using cross attention. arXiv preprint arXiv:2210.04885, 2022.
  40. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16515–16525, 2022.
  41. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
  42. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
  43. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896, 2022.
  44. Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462, 2022.
  45. Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7766–7776, 2023.
  46. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  47. Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021.
  48. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022a.
  49. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022b.
  50. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 833–842, 2021.
  51. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5810, 2019.
Citations (18)

Summary

  • The paper introduces a training-free contrastive method that improves text-to-image fidelity by accurately associating objects with attributes.
  • It leverages attention maps as features in a contrastive loss to enable model-agnostic enhancements on pre-trained diffusion models.
  • Empirical tests show superior CLIP and TIFA scores along with strong user preference for images that reflect complex prompt semantics.

Introduction

Recognizing objects and attributes accurately from textual descriptions has been a fundamental challenge in text-to-image diffusion models. While recent advancements like Stable Diffusion and Imagen have broken new ground in image generation quality, they often stumble when it comes to faithfully representing the semantic intent of complex text prompts. Drawbacks such as missing objects, misattributed characteristics, and incorrect quantities remain pervasive issues that hinder the reliability of these otherwise impressive generative models.

The industry has approached the problem through various solutions, such as optimizing cross-attention maps to emphasize object presence or employing dual loss functions to clearly define attention areas. While these methods have made strides in improving fidelity, they fall short in handling complex prompts due to the bespoke nature of their objective functions which necessitates sub-optimal, prompt-specific tuning.

Methodology

Our proposal, CONFORM, addresses these limitations within a contrastive framework that intuitively maintains the relationship between objects and their attributes while segregating unrelated elements. By treating attributes of a specific object as positive pairs and contrasting them against unrelated objects or attributes, our method enhances the accuracy and detail of object representation considerably.

CONFORM is a training-free approach leveraging a contrastive objective combined with test-time optimization. This means it can be applied to pre-trained models without additional training requirements, yielding improvements in existing setups. Importantly, the technique is model-agnostic and has been tested extensively on leading models like Stable Diffusion and Imagen.

The core technical innovation lies in our use of attention maps, which we treat as features to train our contrastive loss function. These maps, delineating the interface between input text and generated pixels, guide the generation process to produce images that more faithfully adhere to the given prompt.

Results and Conclusion

Empirical evidence from extensive experiments across various datasets and scenarios demonstrates the efficiency and effectiveness of CONFORM. For instance, when tasked with generating images for complex prompts involving multiple objects and attributes, our method not only produces images with missing objects but also correctly binds attributes to their respective subjects—surpassing other state-of-the-art methods.

In terms of quantitative performance, our approach consistently achieves superior image-text similarity, as determined by CLIP scores, and outperforms competitors in terms of TIFA scores—a metric that evaluates text-to-image fidelity.

A user paper further confirms these findings, with participants overwhelmingly choosing images generated by our method as the most accurate representations of given text prompts. These results reinforce our method's capacity to align closely with semantic intent across various content generation tasks.

In summary, the flexibility and robustness of CONFORM underscore a significant advancement towards addressing fidelity issues in text-to-image models. By publicizing our source code, we invite the research community to build upon and extend the achievements of this work, reinforcing the AI's proficiency in understanding and visualizing complex human language.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets