Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Quality Metrics for Text-to-Image Models (2403.11821v4)

Published 18 Mar 2024 in cs.CV, cs.AI, and cs.GR

Abstract: Recent AI-based text-to-image models not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques that offer precise control over scene parameters such as objects, materials, and lighting, when generating realistic images. While the quality of rendered images is traditionally assessed through well-established image quality metrics, such as SSIM or PSNR, the unique challenges presented by text-to-image models, which in contrast to rendering interweave the control of scene and rendering parameters, necessitate the development of novel image quality metrics. Therefore, within this survey, we provide a comprehensive overview of existing text-to-image quality metrics addressing their nuances and the need for alignment with human preferences. Based on our findings, we propose a new taxonomy for categorizing these metrics, which is grounded in the assumption that there are two main quality criteria, namely compositionality and generality, which ideally map to human preferences. Ultimately, we derive guidelines for practitioners conducting text-to-image evaluation, discuss open challenges of evaluation mechanisms, and surface limitations of current metrics.

Evaluating Text to Image Synthesis: A Survey and Taxonomy of Image Quality Metrics

Introduction

The field of text-conditioned image generation has seen significant advancements, enabled by integrating language and vision through large-scale databases. This progression has increased the demand for high-quality image generation that aligns both text and images coherently. Novel evaluation metrics have been developed, aiming to mimic human judgments for validating the quality and alignment between the text and the generated images. In this work, a comprehensive survey of existing text-to-image (T2I) evaluation metrics is presented alongside a proposed taxonomy for categorizing these metrics. Additionally, the paper explores promising approaches for the optimization of T2I synthesis and discusses the ongoing challenges and limitations within current evaluation frameworks.

Taxonomy Development

The core contribution of this work lies in the development of a new taxonomy for T2I evaluation metrics. Prior to the emergence of diffusion-based image generation, the evaluation focused predominantly on image quality measures like the Inception Score (IS) and the Frechet Inception Distance (FID). The proposed taxonomy addresses the need for a structured approach to evaluate the more complex aspect of compositional quality between text and images. The taxonomy differentiates metrics into two principal categories: pure image-based metrics and text-conditioned image quality metrics, further subdivided based on their aims to measure general image quality or compositional quality.

Image Metrics

  • Distribution-based metrics: These metrics, including IS and FID, use statistical measures to evaluate the differences between distributions of real and generated images, focusing solely on image quality without considering text conditions.
  • Single image metrics: Unlike distribution-based metrics, these assess the quality of individual images based on structural and semantic composition. Recent approaches utilize deep learning models that predict human judgments for aesthetic and visual quality.

Text-Image Alignment Metrics

  • Embedding-based metrics: Evaluate general image quality based on learned embedding representations for vision and language inputs, using models like CLIP and BLIP to calculate cosine similarity between text and image embeddings.
  • Content-based metrics: Delve deeper into the qualitative aspects of generated images by examining compositional quality through content analysis, such as object accuracy, spatial relations, and attribute alignment.

Evaluation Metrics Overview

The review highlights the evolution of metrics tailored for specific aspects of T2I synthesis. Embedding-based metrics leverage pre-trained models to assess alignment between text and image representations. In contrast, content-based metrics offer a more granular evaluation by disassembling the prompt into distinct components to measure specific content alignments. Various approaches, like visual question answering (VQA) models and object detection techniques, are employed to validate the compositionality between the textual descriptions and their visual counterparts.

Optimization Approaches

Discussing optimization methods for T2I synthesis, the paper emphasizes the significance of incorporating human judgments into the modeling process. Techniques such as fine-tuning generators on high-quality samples selected by reward models and applying reinforcement learning highlight the potential to enhance text-image alignment, thereby aligning generated images closer to human preferences.

Challenges and Future Directions

One of the key challenges addressed is the development of evaluation frameworks that can account for the intricate and diverse aspects of image quality in relation to the text. The need for evaluation metrics that can offer detailed component-level insights and the importance of constructing more comprehensive and complex benchmark datasets are underscored. Additionally, the adaptation of existing models and metrics to understand and assess visio-linguistic compositionality more effectively is discussed as an avenue for future research.

Conclusion

Through the establishment of a new taxonomy for T2I evaluation metrics and the scrutiny of existing metrics and optimization approaches, this work sets the foundation for future advancements in T2I synthesis evaluation. By addressing current limitations and proposing directions for future research, the paper contributes to the evolving landscape of generative AI, pushing towards models that can generate images which not only are of high quality but also compositionally aligned with their textual descriptions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (168)
  1. J. Singh and L. Zheng, “Divide, evaluate, and refine: Evaluating and improving text-to-image alignment with iterative vqa feedback,” arXiv preprint arXiv:2307.04749, 2023.
  2. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
  3. S. Luo, “A survey on multimodal deep learning for image synthesis: Applications, methods, datasets, evaluation metrics, and results comparison,” in Proceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence, ser. ICIAI ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 108–120. [Online]. Available: https://doi.org/10.1145/3461353.3461388
  4. C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion models in generative ai: A survey,” 2023.
  5. F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023.
  6. J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” 2022.
  7. J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Imagen video: High definition video generation with diffusion models,” 2022.
  8. G. Zhang, J. Bi, J. Gu, and V. Tresp, “Spot! revisiting video-language models for event understanding,” arXiv preprint arXiv:2311.12919, 2023.
  9. C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 300–309.
  10. G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 12 663–12 673.
  11. R. Chen, Y. Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 22 246–22 256.
  12. S. Wang, C. Saharia, C. Montgomery, J. Pont-Tuset, S. Noy, S. Pellegrini, Y. Onoe, S. Laszlo, D. J. Fleet, R. Soricut, J. Baldridge, M. Norouzi, P. Anderson, and W. Chan, “Imagen editor and editbench: Advancing and evaluating text-guided image inpainting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 18 359–18 369.
  13. Z. Wang, C. Lu, Y. Wang, F. Bao, C. Li, H. Su, and J. Zhu, “Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=ppJuFSOAnM
  14. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 684–10 695.
  15. Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 893–911. [Online]. Available: https://aclanthology.org/2023.acl-long.51
  16. AUTOMATIC1111, “Stable Diffusion Web UI,” Aug. 2022. [Online]. Available: https://github.com/AUTOMATIC1111/stable-diffusion-webui
  17. P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023.
  18. J. Xia, W. Lin, G. Jiang, Y. Wang, W. Chen, and T. Schreck, “Visual clustering factors in scatterplots,” IEEE Computer Graphics and Applications, vol. 41, no. 5, pp. 79–89, 2021.
  19. S. Zhou, M. Gordon, R. Krishna, A. Narcomey, L. F. Fei-Fei, and M. Bernstein, “Hype: A benchmark for human eye perceptual evaluation of generative models,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/65699726a3c601b9f31bf04019c8593c-Paper.pdf
  20. B. Dolan, “Computer aided diagnosis in mammography: Its development and early challenges,” in 2006 Fortieth Asilomar Conference on Signals, Systems and Computers, 2006, pp. 821–825.
  21. N. Chang, “Bridging the Gap Between Human Vision and Computer Vision,” 6 2023. [Online]. Available: https://kilthub.cmu.edu/articles/thesis/Bridging_the_Gap_Between_Human_Vision_and_Computer_Vision/23396759
  22. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  23. M. Cao, S. Li, J. Li, L. Nie, and M. Zhang, “Image-text retrieval: A survey on recent research and development,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, L. D. Raedt, Ed.   International Joint Conferences on Artificial Intelligence Organization, 7 2022, pp. 5410–5417, survey Track. [Online]. Available: https://doi.org/10.24963/ijcai.2022/759
  24. J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” arXiv preprint arXiv:2104.08718, 2021.
  25. R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539, 2014.
  26. S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” Advances in neural information processing systems, vol. 29, 2016.
  27. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International conference on machine learning.   PMLR, 2016, pp. 1060–1069.
  28. A. Dash, J. C. B. Gamboa, S. Ahmed, M. Liwicki, and M. Z. Afzal, “Tac-gan-text conditioned auxiliary classifier generative adversarial network,” arXiv preprint arXiv:1703.06412, 2017.
  29. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6629–6640.
  30. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29.   Curran Associates, Inc., 2016. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf
  31. X. Wu, K. Xu, and P. Hall, “A survey of image synthesis and editing with generative adversarial networks,” Tsinghua Science and Technology, vol. 22, no. 6, pp. 660–674, 2017.
  32. J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,” arXiv preprint arXiv:2304.05977, 2023.
  33. Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,” ArXiv, vol. abs/2305.01569, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258437096
  34. X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,” ArXiv, vol. abs/2306.09341, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259171771
  35. S. Hartwig, M. Schelling, C. v. Onzenoodt, P.-P. Vázquez, P. Hermosilla, and T. Ropinski, “Learning human viewpoint preferences from sparsely annotated models,” Computer Graphics Forum, vol. 41, no. 6, pp. 453–466, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.14613
  36. X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li, “Human preference score: Better aligning text-to-image models with human preference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 2096–2105.
  37. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  38. T. M. Dinh, R. Nguyen, and B.-S. Hua, “Tise: Bag of metrics for text-to-image synthesis evaluation,” in European Conference on Computer Vision.   Springer, 2022, pp. 594–609.
  39. P. Grimal, H. Le Borgne, O. Ferret, and J. Tourille, “Tiam - a metric for evaluating alignment in text-to-image generation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2024, pp. 2890–2899.
  40. B. Gordon, Y. Bitton, Y. Shafir, R. Garg, X. Chen, D. Lischinski, D. Cohen-Or, and I. Szpektor, “Mismatch quest: Visual and textual feedback for image-text misalignment,” 2023.
  41. T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross, “Winoground: Probing vision and language models for visio-linguistic compositionality,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5238–5248.
  42. L. Zhang, Y. Zhou, C. Barnes, S. Amirghodsi, Z. Lin, E. Shechtman, and J. Shi, “Perceptual artifacts localization for inpainting,” in European Conference on Computer Vision.   Springer, 2022, pp. 146–164.
  43. L. Zhang, Z. Xu, C. Barnes, Y. Zhou, Q. Liu, H. Zhang, S. Amirghodsi, Z. Lin, E. Shechtman, and J. Shi, “Perceptual artifacts localization for image synthesis tasks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7579–7590.
  44. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  45. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
  46. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning.   PMLR, 2022, pp. 12 888–12 900.
  47. J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256390509
  48. M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou, “When and why vision-language models behave like bags-of-words, and what to do about it?” in The Eleventh International Conference on Learning Representations, 2022.
  49. T. Gokhale, H. Palangi, B. Nushi, V. Vineet, E. Horvitz, E. Kamar, C. Baral, and Y. Yang, “Benchmarking spatial relationships in text-to-image generation,” ArXiv, vol. abs/2212.10015, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254877055
  50. R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, p. 13, 2023.
  51. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen et al., “Simple open-vocabulary object detection with vision transformers. arxiv 2022,” arXiv preprint arXiv:2205.06230, vol. 2, 2022.
  52. D. Reis, J. Kupec, J. Hong, and A. Daoudi, “Real-time flying object detection with yolov8,” arXiv preprint arXiv:2305.09972, 2023.
  53. J. Wu, J. Wang, Z. Yang, Z. Gan, Z. Liu, J. Yuan, and L. Wang, “Grit: A generative region-to-text transformer for object understanding,” arXiv preprint arXiv:2212.00280, 2022.
  54. R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  55. P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds.   Cham: Springer International Publishing, 2016, pp. 382–398.
  56. Y. Cui, G. Yang, A. Veit, X. Huang, and S. Belongie, “Learning to evaluate image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  57. M. Jiang, Q. Huang, L. Zhang, X. Wang, P. Zhang, Z. Gan, J. Diesner, and J. Gao, “Tiger: Text-to-image grounding for image caption evaluation,” arXiv preprint arXiv:1909.02050, 2019.
  58. P. Madhyastha, J. Wang, and L. Specia, “VIFIDEL: Evaluating the visual fidelity of image descriptions,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds.   Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 6539–6550. [Online]. Available: https://aclanthology.org/P19-1654
  59. H. Lee, S. Yoon, F. Dernoncourt, D. S. Kim, T. Bui, and K. Jung, “ViLBERTScore: Evaluating image caption using vision-and-language BERT,” in Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, S. Eger, Y. Gao, M. Peyrard, W. Zhao, and E. Hovy, Eds.   Online: Association for Computational Linguistics, Nov. 2020, pp. 34–39. [Online]. Available: https://aclanthology.org/2020.eval4nlp-1.4
  60. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  61. J. Chen, J. Chen, H. Chao, and M. Yang, “Image blind denoising with generative adversarial network based noise modeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  62. L. D. Tran, S. M. Nguyen, and M. Arai, “Gan-based noise model for denoising real images,” in Proceedings of the Asian Conference on Computer Vision (ACCV), November 2020.
  63. C. Tian, X. Zhang, C.-W. Lin, W. Zuo, and Y. Zhang, “Generative adversarial networks for image super-resolution: A survey,” ArXiv, vol. abs/2204.13620, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:248426817
  64. W. Ahmad, H. Ali, Z. Shah, and S. Azmat, “A new generative adversarial network for medical images super resolution,” Scientific Reports, vol. 12, no. 1, p. 9533, 2022.
  65. S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, “Adversarial text-to-image synthesis: A review,” Neural Netw., vol. 144, no. C, p. 187–209, dec 2021. [Online]. Available: https://doi.org/10.1016/j.neunet.2021.07.019
  66. J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the look and feel of images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2555–2563.
  67. J.-H. Kim, Y. Kim, J. Lee, K. M. Yoo, and S.-W. Lee, “Mutual information divergence: A unified metric for multimodal generative models,” Advances in Neural Information Processing Systems, vol. 35, pp. 35 072–35 086, 2022.
  68. S. Castro, A. Ziai, A. Saluja, Z. Yuan, and R. Mihalcea, “Clove: Encoding compositional language in contrastive vision-language models,” arXiv preprint arXiv:2402.15021, 2024.
  69. D. H. Park, S. Azadi, X. Liu, T. Darrell, and A. Rohrbach, “Benchmark for compositional text-to-image synthesis,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. [Online]. Available: https://openreview.net/forum?id=bKBhQhPeKaF
  70. Y. Liang, J. He, G. Li, P. Li, A. Klimovskiy, N. Carolan, J. Sun, J. Pont-Tuset, S. Young, F. Yang et al., “Rich human feedback for text-to-image generation,” arXiv preprint arXiv:2312.10240, 2023.
  71. Z. Ma, C. Wang, Y. Ouyang, F. Zhao, J. Zhang, S. Huang, and J. Chen, “Cobra effect in reference-free image captioning metrics,” arXiv preprint arXiv:2402.11572, 2024.
  72. T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2018, pp. 1316–1324. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00143
  73. M. Yarom, Y. Bitton, S. Changpinyo, R. Aharoni, J. Herzig, O. Lang, E. O. Ofek, and I. Szpektor, “What you see is what you read? improving text-image alignment evaluation,” ArXiv, vol. abs/2305.10400, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258740893
  74. T. Hinz, S. Heinrich, and S. Wermter, “Semantic object accuracy for generative text-to-image synthesis,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1552–1565, 2020.
  75. F. Betti, J. Staiano, L. Baraldi, L. Baraldi, R. Cucchiara, and N. Sebe, “Let’s vice! mimicking human cognitive behavior in image generation evaluation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9306–9312.
  76. Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith, “Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 20 406–20 417.
  77. K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,” arXiv preprint arXiv: 2307.06350, 2023.
  78. M. Ku, D. Jiang, C. Wei, X. Yue, and W. Chen, “Viescore: Towards explainable metrics for conditional image synthesis evaluation,” arXiv preprint arXiv:2312.14867, 2023.
  79. Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang, “LLMScore: Unveiling the power of large language models in text-to-image synthesis evaluation,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=OJ0c6um1An
  80. C.-Y. Bai, H.-T. Lin, C. Raffel, and W. C.-w. Kan, “On training sample memorization: Lessons from benchmarking generative modeling with a large-scale competition,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2534–2542.
  81. M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,” in International Conference on Learning Representations, 2018.
  82. D. Lopez-Paz and M. Oquab, “Revisiting classifier two-sample tests,” in International Conference on Learning Representations, 2016.
  83. M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly, “Assessing generative models via precision and recall,” Advances in neural information processing systems, vol. 31, 2018.
  84. T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila, “Improved precision and recall metric for assessing generative models,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  85. S. Gu, J. Bao, D. Chen, and F. Wen, “Giqa: Generated image quality assessment,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16.   Springer, 2020, pp. 369–385.
  86. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
  87. S. Ravuri and O. Vinyals, “Classification accuracy score for conditional generative models,” Advances in neural information processing systems, vol. 32, 2019.
  88. X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
  89. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014. [Online]. Available: https://aclanthology.org/Q14-1006
  90. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2641–2649.
  91. H. Singh, P. Zhang, Q. Wang, M. Wang, W. Xiong, J. Du, and Y. Chen, “Coarse-to-fine contrastive learning in image-text-graph space for improved vision-language compositionality,” arXiv preprint arXiv:2305.13812, 2023.
  92. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.
  93. R. Hu and A. Singh, “Unit: Multimodal multitask learning with a unified transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1439–1449.
  94. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in European conference on computer vision.   Springer, 2020, pp. 104–120.
  95. Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 6616–6628, 2020.
  96. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 5579–5588.
  97. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16.   Springer, 2020, pp. 121–137.
  98. W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5583–5594.
  99. K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  100. S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola, “Dreamsim: Learning new dimensions of human visual similarity using synthetic data,” arXiv preprint arXiv:2306.09344, 2023.
  101. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 22 500–22 510.
  102. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  103. X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer et al., “Pali: A jointly-scaled multilingual language-image model,” arXiv preprint arXiv:2209.06794, 2022.
  104. M. Honnibal and I. Montani, “spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing,” To appear, vol. 7, no. 1, pp. 411–420, 2017.
  105. J. Liang, W. Pei, and F. Lu, “Cpgan: Content-parsing generative adversarial networks for text-to-image synthesis,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16.   Springer, 2020, pp. 491–508.
  106. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  107. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
  108. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” ArXiv, vol. abs/2304.08485, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258179774
  109. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  110. D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi, “Unifiedqa: Crossing format boundaries with a single qa system,” arXiv preprint arXiv:2005.00700, 2020.
  111. C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, G. Xu, Z. Cao, J. Zhang, S. Huang, F. Huang, J. Zhou, and L. Si, “mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 7241–7259. [Online]. Available: https://aclanthology.org/2022.emnlp-main.488
  112. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  113. M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013.
  114. J. Wang and R. Gaizauskas, “Generating image descriptions with gold standard visual inputs: Motivation, evaluation and baselines,” in Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), A. Belz, A. Gatt, F. Portet, and M. Purver, Eds.   Brighton, UK: Association for Computational Linguistics, Sep. 2015, pp. 117–126. [Online]. Available: https://aclanthology.org/W15-4722
  115. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37.   Lille, France: PMLR, 07–09 Jul 2015, pp. 957–966. [Online]. Available: https://proceedings.mlr.press/v37/kusnerb15.html
  116. T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr
  117. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf
  118. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser. ACL ’02.   USA: Association for Computational Linguistics, 2002, p. 311–318. [Online]. Available: https://doi.org/10.3115/1073083.1073135
  119. C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out.   Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
  120. S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, J. Goldstein, A. Lavie, C.-Y. Lin, and C. Voss, Eds.   Ann Arbor, Michigan: Association for Computational Linguistics, Jun. 2005, pp. 65–72. [Online]. Available: https://aclanthology.org/W05-0909
  121. A. Lavie, K. Sagae, and S. Jayaraman, “The significance of recall in automatic metrics for mt evaluation,” in Machine Translation: From Real Users to Research: 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, September 28-October 2, 2004. Proceedings 6.   Springer, 2004, pp. 134–143.
  122. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  123. S. Barratt and R. Sharma, “A note on the inception score,” arXiv preprint arXiv:1801.01973, 2018.
  124. T. Che, Y. Li, A. Jacob, Y. Bengio, and W. Li, “Mode regularized generative adversarial networks,” in International Conference on Learning Representations, 2016.
  125. A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola, “A kernel method for the two-sample-problem,” Advances in neural information processing systems, vol. 19, 2006.
  126. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  127. A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
  128. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  129. C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, 2010, pp. 139–147.
  130. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results,” http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html.
  131. V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., vol. 24.   Curran Associates, Inc., 2011. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
  132. C. L. Zitnick and D. Parikh, “Bringing semantics into focus using visual abstraction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
  133. H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “nocaps: novel object captioning at scale,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  134. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
  135. J. Mao, J. Xu, Y. Jing, and A. Yuille, “Training and evaluating multimodal word embeddings with large-scale web annotated images,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16.   Red Hook, NY, USA: Curran Associates Inc., 2016, p. 442–450.
  136. J. Kiros, W. Chan, and G. Hinton, “Illustrative language understanding: Large-scale visual grounding with image search,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 922–933.
  137. P. Jenkins, A. Farag, S. Wang, and Z. Li, “Unsupervised representation learning of spatial data via multimodal embedding,” in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 1993–2002.
  138. K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 11 162–11 173.
  139. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
  140. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum, “Flumejava: easy, efficient data-parallel pipelines,” in Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’10.   New York, NY, USA: Association for Computing Machinery, 2010, p. 363–375. [Online]. Available: https://doi.org/10.1145/1806596.1806638
  141. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 3558–3568.
  142. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  143. R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  144. W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SygXPaEYvH
  145. J. Cho, J. Lei, H. Tan, and M. Bansal, “Unifying vision-and-language tasks via text generation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 1931–1942. [Online]. Available: https://proceedings.mlr.press/v139/cho21a.html
  146. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, pp. 32 – 73, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:4492210
  147. G. A. Miller, “Wordnet: a lexical database for english,” Commun. ACM, vol. 38, no. 11, p. 39–41, nov 1995. [Online]. Available: https://doi.org/10.1145/219717.219748
  148. W. Feng, X. He, T.-J. Fu, V. Jampani, A. R. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=PUIqjT4rzq7
  149. N. Xie, F. Lai, D. Doran, and A. Kadav, “Visual entailment task for visually-grounded language learning,” 2019.
  150. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
  151. A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele, “Movie description,” 2016.
  152. Z. Li, M. R. Min, K. Li, and C. Xu, “Stylet2i: Toward compositional and high-fidelity text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 18 197–18 207.
  153. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  154. H. Dong, W. Xiong, D. Goyal, Y. Zhang, W. Chow, R. Pan, S. Diao, J. Zhang, K. SHUM, and T. Zhang, “RAFT: Reward ranked finetuning for generative foundation model alignment,” Transactions on Machine Learning Research, 2023. [Online]. Available: https://openreview.net/forum?id=m7p5O7zblY
  155. K. Lee, H. Liu, M. Ryu, O. Watkins, Y. Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,” 2023.
  156. M. Prabhudesai, A. Goyal, D. Pathak, and K. Fragkiadaki, “Aligning text-to-image diffusion models with reward backpropagation,” 2023.
  157. Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee, “Reinforcement learning for fine-tuning text-to-image diffusion models,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=8OTPepXzeh
  158. K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine, “Training diffusion models with reinforcement learning,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=YCWjhGrJFD
  159. N. Liu, S. Li, Y. Du, A. Torralba, and J. B. Tenenbaum, “Compositional visual generation with composable diffusion models,” in European Conference on Computer Vision.   Springer, 2022, pp. 423–439.
  160. H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–10, 2023.
  161. Y. Li, M. Keuper, D. Zhang, and A. Khoreva, “Divide & bind your attention for improved generative semantic nursing,” arXiv preprint arXiv:2307.10864, 2023.
  162. M. Menéndez, J. Pardo, L. Pardo, and M. Pardo, “The jensen-shannon divergence,” Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0016003296000634
  163. M. Chen, I. Laina, and A. Vedaldi, “Training-free layout control with cross-attention guidance,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2024, pp. 5343–5353.
  164. Z. Ma, J. Hong, M. O. Gul, M. Gandhi, I. Gao, and R. Krishna, “Crepe: Can vision-language foundation models reason compositionally?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 10 910–10 921.
  165. C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, and R. Krishna, “Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. [Online]. Available: https://openreview.net/forum?id=Jsc7WSCZd4
  166. A. Ray, F. Radenovic, A. Dubey, B. Plummer, R. Krishna, and K. Saenko, “cola: A benchmark for compositional text-to-image retrieval,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  167. N. Dehouche and K. Dehouche, “What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,” Heliyon, vol. 9, no. 6, 2023.
  168. L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sebastian Hartwig (6 papers)
  2. Dominik Engel (11 papers)
  3. Leon Sick (7 papers)
  4. Hannah Kniesel (3 papers)
  5. Tristan Payer (3 papers)
  6. Timo Ropinski (48 papers)
  7. Poonam Poonam (2 papers)
  8. Michael Glöckler (1 paper)
  9. Alex Bäuerle (11 papers)
Citations (1)