Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation (2307.05134v2)

Published 11 Jul 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: The progress in the generation of synthetic images has made it crucial to assess their quality. While several metrics have been proposed to assess the rendering of images, it is crucial for Text-to-Image (T2I) models, which generate images based on a prompt, to consider additional aspects such as to which extent the generated image matches the important content of the prompt. Moreover, although the generated images usually result from a random starting point, the influence of this one is generally not considered. In this article, we propose a new metric based on prompt templates to study the alignment between the content specified in the prompt and the corresponding generated images. It allows us to better characterize the alignment in terms of the type of the specified objects, their number, and their color. We conducted a study on several recent T2I models about various aspects. An additional interesting result we obtained with our approach is that image quality can vary drastically depending on the noise used as a seed for the images. We also quantify the influence of the number of concepts in the prompt, their order as well as their (color) attributes. Finally, our method allows us to identify some seeds that produce better images than others, opening novel directions of research on this understudied topic.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv 2211.01324, 2023.
  2. A note on the inception score. In Proc. ICML 2018 Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
  3. Basic Color Terms: Their Universality and Evolution. University of California Press, Los Angeles, 1969.
  4. The texture lexicon: Understanding the categorization of visual texture terms and their relationship to texture images. Cognitive Science, 21(2):219–246, 1997.
  5. Ali Borji. Pros and cons of gan evaluation measures. Computer Vision and Image Understanding, 179:41–65, 2019.
  6. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph., 42(4), jul 2023.
  7. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arXiv 2202.04053, 2022.
  8. Describing textures in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
  9. Thomas G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, pages 1–15, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg.
  10. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.
  11. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  12. J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378––382, 1971.
  13. Scaling open-vocabulary image segmentation with image-level labels. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 540–557, Cham, 2022. Springer Nature Switzerland.
  14. Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015, 2022.
  15. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, 2014.
  16. Open-vocabulary object detection via vision and language knowledge distillation. arXiv 2104.13921, 2022.
  17. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, page 6629–6640, Dec 2017.
  18. Towards a definition of disentangled representations. arXiv 1812.02230, 2018.
  19. Semantic object accuracy for generative text-to-image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1552–1565, mar 2022.
  20. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020.
  21. Classifier-free diffusion guidance. arXiv 2207.12598, 2022.
  22. Humeau-Heurtier. Texture feature extraction methods: A survey. IEEE Access, 7:8975–9000, 2019.
  23. Yolo by ultralytics, 2023.
  24. Julesz. Visual pattern discrimination. IRE Transactions on Information Theory, 8(2), 1962.
  25. World Color Survey, pages 1265–1271. Springer New York, New York, NY, 2016.
  26. The measurement of observer agreement for categorical data. Biometrics, 33:159–174, 1977.
  27. Karlo-v1.0.alpha on coyo-100m and cc15m. https://github.com/kakaobrain/karlo, 2022.
  28. Language-driven semantic segmentation. In International Conference on Learning Representations, 2022.
  29. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  30. Grounded language-image pre-training. In CVPR, 2022.
  31. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
  32. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, volume 35, pages 5775–5787. Curran Associates, Inc., 2022.
  33. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 16784–16804. PMLR, 17–23 Jul 2022.
  34. Jonas Oppenlaender. A taxonomy of prompt modifiers for text-to-image generation. arXiv 2204.13988, 2023.
  35. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
  36. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2016.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv 1910.10683, 2020.
  38. Hierarchical text-conditional image generation with clip latents. arXiv 2204.06125, 2022.
  39. Classification accuracy score for conditional generative models. In Advances in Neural Information Processing Systems, 2019.
  40. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  41. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  42. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  43. Monocular depth estimation using diffusion models. arXiv 2302.14816, 2023.
  44. What the DAAM: Interpreting stable diffusion using cross attention. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5644–5659, Toronto, Canada, July 2023. Association for Computational Linguistics.
  45. A note on the evaluation of generative models. In International Conference on Learning Representations, Apr 2016.
  46. Linear spaces of meanings: Compositional structures in vision-language models. In ICCV 2023, 2023.
  47. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  48. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7031–7040, June 2023.
  49. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2023.
  50. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  51. Auditing gender presentation differences in text-to-image models. arXiv 2302.03675, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Paul Grimal (2 papers)
  2. Hervé Le Borgne (26 papers)
  3. Olivier Ferret (11 papers)
  4. Julien Tourille (2 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.