Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior Constraints (2308.02669v2)

Published 3 Aug 2023 in cs.CV

Abstract: Recent text-to-image generative models have enabled us to transform our words into vibrant, captivating imagery. The surge of personalization techniques that has followed has also allowed us to imagine unique concepts in new scenes. However, an intriguing question remains: How can we generate a new, imaginary concept that has never been seen before? In this paper, we present the task of creative text-to-image generation, where we seek to generate new members of a broad category (e.g., generating a pet that differs from all existing pets). We leverage the under-studied Diffusion Prior models and show that the creative generation problem can be formulated as an optimization process over the output space of the diffusion prior, resulting in a set of "prior constraints". To keep our generated concept from converging into existing members, we incorporate a question-answering Vision-LLM (VLM) that adaptively adds new constraints to the optimization problem, encouraging the model to discover increasingly more unique creations. Finally, we show that our prior constraints can also serve as a strong mixing mechanism allowing us to create hybrids between generated concepts, introducing even more flexibility into the creative process.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Controlled and Conditional Text to Image Generation with Diffusion Prior. arXiv preprint arXiv:2302.11710 (2023).
  2. A Neural Space-Time Representation for Text-to-Image Personalization. arXiv preprint arXiv:2305.15391 (2023).
  3. Blended Latent Diffusion. arXiv preprint arXiv:2206.02779 (2022).
  4. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324 [cs.CV]
  5. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.
  6. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. arXiv:2301.13826 [cs.CV]
  7. “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX. Springer, 558–577.
  8. Daniel Cohen-Or and Hao Zhang. 2016. From inspired modeling to creative modeling. The Visual Computer 32 (2016), 7–14.
  9. DiffEdit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=3lge0p5o-M-
  10. Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.
  11. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems 35 (2022), 16890–16902.
  12. Can: Creative adversarial networks, generating” art” by learning about styles and deviating from style norms. arXiv preprint arXiv:1706.07068 (2017).
  13. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011 (2023).
  14. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=NAQvF08TcyG
  15. Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models. arXiv:2302.12228 [cs.CV]
  16. Creative Sketch Generation. In International Conference on Learning Representations. https://openreview.net/forum?id=gwnoVHIES05
  17. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
  18. Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=_CDixzkzeyb
  19. Aaron Hertzmann. 2018. Can computers create art?. In Arts, Vol. 7. MDPI, 18.
  20. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
  21. Imagic: Text-Based Real Image Editing with Diffusion Models. In Conference on Computer Vision and Pattern Recognition 2023.
  22. Multi-Concept Customization of Text-to-Image Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
  24. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision. Springer, 423–439.
  25. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations. https://openreview.net/forum?id=aBsCjcPu_tE
  26. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers. 1–8.
  27. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
  28. Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. PMLR, 8162–8171.
  29. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23).
  30. pharmapsychotic. 2022. clip-interrogator. https://github.com/pharmapsychotic/clip-interrogator.
  31. DreamFusion: Text-to-3D using 2D Diffusion. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=FjNys5c7VyY
  32. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
  33. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
  34. Warunika Lakmini Ranaweera. 2016. ExquiMo: An exquisite corpse tool for co-creative 3d shape modeling. (2016).
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
  36. DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  37. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479–36494.
  38. Design: Design inspiration from generative networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0–0.
  39. Kandinsky 2. https://github.com/ai-forever/Kandinsky-2.
  40. InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning. arXiv:2304.03411 [cs.CV]
  41. Karl Sims. 1991. Artificial evolution for computer graphics. In Proceedings of the 18th annual conference on Computer graphics and interactive techniques. 319–328.
  42. Karl Sims. 1994. Evolving virtual creatures. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques. 15–22.
  43. Make-A-Video: Text-to-Video Generation without Text-Video Data. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=nJfylDvgzlq
  44. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
  45. Denoising Diffusion Implicit Models. In International Conference on Learning Representations. https://openreview.net/forum?id=St1giarCHLP
  46. Key-Locked Rank One Editing for Text-to-Image Personalization. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23).
  47. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1921–1930.
  48. Concept Decomposition for Visual Exploration and Inspiration. arXiv preprint arXiv:2305.18203 (2023).
  49. P+limit-from𝑃P+italic_P +: Extended Textual Conditioning in Text-to-Image Generation. arXiv:2303.09522 [cs.CV]
  50. ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation. arXiv preprint arXiv:2302.13848 (2023).
  51. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20908–20918.
  52. Fit and diverse: Set evolution for inspiring 3d shape galleries. ACM Transactions on Graphics (TOG) 31, 4 (2012), 1–10.
  53. Shifted diffusion for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10157–10166.
Citations (2)

Summary

We haven't generated a summary for this paper yet.