Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models (2410.22775v2)
Abstract: Text-to-image (T2I) generative models, such as Stable Diffusion and DALL-E, have shown remarkable proficiency in producing high-quality, realistic, and natural images from textual descriptions. However, these models sometimes fail to accurately capture all the details specified in the input prompts, particularly concerning entities, attributes, and spatial relationships. This issue becomes more pronounced when the prompt contains novel or complex compositions, leading to what are known as compositional generation failure modes. Recently, a new open-source diffusion-based T2I model, FLUX, has been introduced, demonstrating strong performance in high-quality image generation. Additionally, autoregressive T2I models like LlamaGen have claimed competitive visual quality performance compared to diffusion-based models. In this study, we evaluate the compositional generation capabilities of these newly introduced models against established models using the T2I-CompBench benchmark. Our findings reveal that LlamaGen, as a vanilla autoregressive model, is not yet on par with state-of-the-art diffusion models for compositional generation tasks under the same criteria, such as model size and inference time. On the other hand, the open-source diffusion-based model FLUX exhibits compositional generation capabilities comparable to the state-of-the-art closed-source model DALL-E3.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- A-star: Test-time attention segregation and retention for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2283–2293, 2023.
- Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20041–20053, 2023.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
- Getting it right: Improving spatial consistency in text-to-image models. arXiv preprint arXiv:2404.01197, 2024.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
- Pixart-\α\absent𝛼\backslash\alpha\ italic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
- Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.
- Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023.
- Benchmarking spatial relationships in text-to-image generation. ArXiv, abs/2212.10015, 2022.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747, 2023.
- Counting guidance for high fidelity text-to-image synthesis. arXiv preprint arXiv:2306.17567, 2023.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- Black Forest Lab. https://www.blackforestlab.com/blog, 2024. Accessed: 2024-09.
- Black Forest Lab. Flux: A diffusion-based text-to-image (t2i) model. https://github.com/blackforestlab/flux, 2024. Accessed: 2024-09.
- LAION. Laion-coco 600m. https://laion.ai/blog/laion-coco, 2022.
- Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743, 2024.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
- Divide & bind your attention for improved generative semantic nursing. In 34th British Machine Vision Conference 2023, BMVC 2023, 2023.
- Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task. Advances in Neural Information Processing Systems, 36, 2024.
- OpenAI. https://www.openai.com, 2024. Accessed: 2024-09.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
- Compositional capabilities of autoregressive transformers: A study on synthetic, interpretable tasks. In Forty-first International Conference on Machine Learning, 2024.
- Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Advances in Neural Information Processing Systems, 36, 2024.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Predicated diffusion: Predicate logic-based attention guidance for text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8651–8660, 2024.
- Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Dvae#: Discrete variational autoencoders with relaxed boltzmann priors. Advances in Neural Information Processing Systems, 31, 2018.
- Compositional text-to-image synthesis with attention map control of diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5544–5552, 2024.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Towards better text-to-image generation alignment via attention modulation. arXiv preprint arXiv:2404.13899, 2024.
- Iterative object count optimization for text-to-image diffusion models. arXiv preprint arXiv:2408.11721, 2024.
- Enhancing semantic fidelity in text-to-image synthesis: Attention regulation in diffusion models. arXiv preprint arXiv:2403.06381, 2024.
- Simple multi-dataset detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7571–7580, 2022.