The Challenges of Image Generation Models in Generating Multi-Component Images (2311.13620v1)
Abstract: Recent advances in text-to-image generators have led to substantial capabilities in image generation. However, the complexity of prompts acts as a bottleneck in the quality of images generated. A particular under-explored facet is the ability of generative models to create high-quality images comprising multiple components given as a prior. In this paper, we propose and validate a metric called Components Inclusion Score (CIS) to evaluate the extent to which a model can correctly generate multiple components. Our results reveal that the evaluated models struggle to incorporate all the visual elements from prompts with multiple components (8.53% drop in CIS per component for all evaluated models). We also identify a significant decline in the quality of the images and context awareness within an image as the number of components increased (15.91% decrease in inception Score and 9.62% increase in Frechet Inception Distance). To remedy this issue, we fine-tuned Stable Diffusion V2 on a custom-created test dataset with multiple components, outperforming its vanilla counterpart. To conclude, these findings reveal a critical limitation in existing text-to-image generators, shedding light on the challenge of generating multiple components within a single image using a complex prompt.
- How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning, pages 290–306. PMLR, 2022.
- Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4561–4569, 2019.
- Gans. In International Conference on Learning Representations (ICLR), 2018.
- Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
- Effectively unbiased fid and inception score and where to find them. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6070–6079, 2020.
- Discovering the hidden vocabulary of dalle-2. arXiv preprint arXiv:2206.00169, 2022.
- Dall·e mini, 2021.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
- Context-aware layout to image generation with enhanced object appearance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15049–15058, 2021.
- CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021a. Association for Computational Linguistics.
- CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021b. Association for Computational Linguistics.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Generating multiple objects at spatially distinct locations. In International Conference on Learning Representations, 2018.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Tiger: Text-to-image grounding for image caption evaluation. arXiv preprint arXiv:1909.02050, 2019.
- Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, 2020.
- Image inpainting guided by coherence priors of semantics and textures. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6539–6548, 2021.
- Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
- An improved evaluation framework for generative adversarial networks. arXiv preprint arXiv:1803.07474, 2018.
- Raphaël Millière. Adversarial attacks on image generation with made-up words. arXiv preprint arXiv:2208.04135, 2022.
- Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Toward verifiable and reproducible human evaluation for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14277–14286, 2023.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 823–832, 2021.
- On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022.
- Learn, imagine and create: Text-to-image generation from prior knowledge. Advances in neural information processing systems, 32, 2019a.
- Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1505–1514, 2019b.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models, 2021.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
- High-resolution image inpainting with iterative confidence feedback and guided upsampling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, pages 1–17. Springer, 2020.
- Cr-fill: Generative image inpainting with auxiliary contextual reconstruction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14164–14173, 2021.