ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior Constraints (2308.02669v2)
Abstract: Recent text-to-image generative models have enabled us to transform our words into vibrant, captivating imagery. The surge of personalization techniques that has followed has also allowed us to imagine unique concepts in new scenes. However, an intriguing question remains: How can we generate a new, imaginary concept that has never been seen before? In this paper, we present the task of creative text-to-image generation, where we seek to generate new members of a broad category (e.g., generating a pet that differs from all existing pets). We leverage the under-studied Diffusion Prior models and show that the creative generation problem can be formulated as an optimization process over the output space of the diffusion prior, resulting in a set of "prior constraints". To keep our generated concept from converging into existing members, we incorporate a question-answering Vision-LLM (VLM) that adaptively adds new constraints to the optimization problem, encouraging the model to discover increasingly more unique creations. Finally, we show that our prior constraints can also serve as a strong mixing mechanism allowing us to create hybrids between generated concepts, introducing even more flexibility into the creative process.
- Controlled and Conditional Text to Image Generation with Diffusion Prior. arXiv preprint arXiv:2302.11710 (2023).
- A Neural Space-Time Representation for Text-to-Image Personalization. arXiv preprint arXiv:2305.15391 (2023).
- Blended Latent Diffusion. arXiv preprint arXiv:2206.02779 (2022).
- eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324 [cs.CV]
- InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.
- Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. arXiv:2301.13826 [cs.CV]
- “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX. Springer, 558–577.
- Daniel Cohen-Or and Hao Zhang. 2016. From inspired modeling to creative modeling. The Visual Computer 32 (2016), 7–14.
- DiffEdit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=3lge0p5o-M-
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems 35 (2022), 16890–16902.
- Can: Creative adversarial networks, generating” art” by learning about styles and deviating from style norms. arXiv preprint arXiv:1706.07068 (2017).
- Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011 (2023).
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=NAQvF08TcyG
- Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models. arXiv:2302.12228 [cs.CV]
- Creative Sketch Generation. In International Conference on Learning Representations. https://openreview.net/forum?id=gwnoVHIES05
- Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
- Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=_CDixzkzeyb
- Aaron Hertzmann. 2018. Can computers create art?. In Arts, Vol. 7. MDPI, 18.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
- Imagic: Text-Based Real Image Editing with Diffusion Models. In Conference on Computer Vision and Pattern Recognition 2023.
- Multi-Concept Customization of Text-to-Image Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
- Compositional visual generation with composable diffusion models. In European Conference on Computer Vision. Springer, 423–439.
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations. https://openreview.net/forum?id=aBsCjcPu_tE
- Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers. 1–8.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. PMLR, 8162–8171.
- Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23).
- pharmapsychotic. 2022. clip-interrogator. https://github.com/pharmapsychotic/clip-interrogator.
- DreamFusion: Text-to-3D using 2D Diffusion. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=FjNys5c7VyY
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
- Warunika Lakmini Ranaweera. 2016. ExquiMo: An exquisite corpse tool for co-creative 3d shape modeling. (2016).
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
- DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
- Design: Design inspiration from generative networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0–0.
- Kandinsky 2. https://github.com/ai-forever/Kandinsky-2.
- InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning. arXiv:2304.03411 [cs.CV]
- Karl Sims. 1991. Artificial evolution for computer graphics. In Proceedings of the 18th annual conference on Computer graphics and interactive techniques. 319–328.
- Karl Sims. 1994. Evolving virtual creatures. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques. 15–22.
- Make-A-Video: Text-to-Video Generation without Text-Video Data. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=nJfylDvgzlq
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
- Denoising Diffusion Implicit Models. In International Conference on Learning Representations. https://openreview.net/forum?id=St1giarCHLP
- Key-Locked Rank One Editing for Text-to-Image Personalization. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23).
- Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1921–1930.
- Concept Decomposition for Visual Exploration and Inspiration. arXiv preprint arXiv:2305.18203 (2023).
- P+limit-from𝑃P+italic_P +: Extended Textual Conditioning in Text-to-Image Generation. arXiv:2303.09522 [cs.CV]
- ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation. arXiv preprint arXiv:2302.13848 (2023).
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20908–20918.
- Fit and diverse: Set evolution for inspiring 3d shape galleries. ACM Transactions on Graphics (TOG) 31, 4 (2012), 1–10.
- Shifted diffusion for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10157–10166.