ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models (2305.16225v3)
Abstract: Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes such as material, style, and layout remains a challenge, leading to a lack of disentanglement and editability. To address this problem, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information, providing a new perspective on representing, generating, and editing images. We develop the Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called \sysname. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer better disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout, achieving previously unattainable results from a single image input without fine-tuning the diffusion models. Our source code is available athttps://github.com/zyxElsa/ProSpect.
- 1984. SIGCOMM Comput. Commun. Rev. 13-14, 5-1 (1984).
- 2008. CHI ’08: CHI ’08 extended abstracts on Human factors in computing systems (Florence, Italy). ACM, New York, NY, USA. General Chair-Czerwinski, Mary and General Chair-Lund, Arnie and Program Chair-Tan, Desney.
- Rafal Ablamowicz and Bertfried Fauser. 2007. CLIFFORD: a Maple 11 Package for Clifford Algebra Computations, version 11. Retrieved February 28, 2008 from http://math.tntech.edu/rafal/cliff11/index.html
- Patricia S. Abril and Robert Plant. 2007. The patent holder’s dilemma: Buy, sell, or troll? Commun. ACM 50, 1 (2007), 36–44. https://doi.org/10.1145/1188913.1188915
- Sten Andler. 1979. Predicate Path expressions. In Proceedings of the 6th. ACM SIGACT-SIGPLAN symposium on Principles of Programming Languages (POPL ’79). ACM Press, New York, NY, 226–236. https://doi.org/10.1145/567752.567774
- David A. Anisi. 2003. Optimal Motion Control of a Ground Vehicle. Master’s thesis. Royal Institute of Technology (KTH), Stockholm, Sweden.
- Art Institute of Chicago. 2023. https://www.artic.edu/ Last accessed on 2023-09-12.
- Blended Diffusion for Text-Driven Editing of Natural Images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18208–18218.
- eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv preprint arXiv:2211.01324 (2022).
- Paint by word. arXiv preprint arXiv:2103.10951 (2021).
- Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations (ICLR).
- InstructPix2Pix: Learning to Follow Image Editing Instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18392–18402.
- Vertex Types in Book-Embeddings. Technical Report. Amherst, MA, USA.
- Muse: Text-To-Image Generation via Masked Generative Transformers. In International Conference on Machine Learning (ICML).
- Min Jin Chong and David Forsyth. 2022. JoJoGAN: One Shot Face Stylization. In European Conference on Computer Vision (ECCV) (Tel Aviv, Israel). Springer-Verlag, Berlin, Heidelberg, 128–152.
- Kenneth L. Clarkson. 1985a. Algorithms for Closest-Point Problems (Computational Geometry). Ph. D. Dissertation. Stanford University, Palo Alto, CA. UMI Order Number: AAT 8506171.
- Kenneth Lee Clarkson. 1985b. Algorithms for Closest-Point Problems (Computational Geometry). Ph. D. Dissertation. Stanford University, Stanford, CA, USA. Advisor(s) Yao, Andrew C. AAT 8506171.
- Jacques Cohen (Ed.). 1996. Special issue: Digital Libraries. Commun. ACM 39, 11 (1996).
- Deciding equivalances among conjunctive aggregate queries. J. ACM 54, 2, Article 5 (2007), 50 pages. https://doi.org/10.1145/1219092.1219093
- (new) Distributed data source verification in wireless sensor networks. Inf. Fusion 10, 4 (2009), 342–353. https://doi.org/10.1016/j.inffus.2009.01.002
- (old) Distributed data source verification in wireless sensor networks. Inf. Fusion 10, 4 (2009), 342–353. https://doi.org/10.1016/j.inffus.2009.01.002
- VQGAN-CLIP: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision (ECCV). Springer, 88–105.
- StyTr22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Image Style Transfer with Transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11326–11336.
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS. 8780–8794.
- Statecarts in use: structured analysis and object-orientation. In Lectures on Embedded Systems, Grzegorz Rozenberg and Frits W. Vaandrager (Eds.). Lecture Notes in Computer Science, Vol. 1494. Springer-Verlag, London, 368–394. https://doi.org/10.1007/3-540-65193-4_29
- Ian Editor (Ed.). 2007. The title of book one (1st. ed.). The name of the series one, Vol. 9. University of Chicago Press, Chicago. https://doi.org/10.1007/3-540-09237-4
- Ian Editor (Ed.). 2008. The title of book two (2nd. ed.). University of Chicago Press, Chicago, Chapter 100. https://doi.org/10.1007/3-540-09237-4
- Taming Transformers for High-Resolution Image Synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12873–12883.
- Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision (ECCV). Springer, 89–106.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In International Conference on Learning Representations (ICLR).
- Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–13.
- StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. ACM Transactions on Graphics 41, 4, Article 141 (2022), 13 pages.
- Dan Geiger and Christopher Meek. 2005. Structured Variational Inference Procedures and their Realizations (as incol). In Proceedings of Tenth International Workshop on Artificial Intelligence and Statistics, The Barbados. The Society for Artificial Intelligence and Statistics.
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc.
- The Latex Web Companion: Integrating TEX, HTML, and XML (1st ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
- Catch me, if you can: Evading network signatures with web-based polymorphic worms. In Proceedings of the first USENIX workshop on Offensive Technologies (WOOT ’07). USENIX Association, Berkley, CA, Article 7, 9 pages.
- Catch me, if you can: Evading network signatures with web-based polymorphic worms. In Proceedings of the first USENIX workshop on Offensive Technologies (WOOT ’08). USENIX Association, Berkley, CA, Article 7, 2 pages.
- Catch me, if you can: Evading network signatures with web-based polymorphic worms. In Proceedings of the first USENIX workshop on Offensive Technologies (WOOT ’09). USENIX Association, Berkley, CA, 90–100.
- David Harel. 1978. LOGICS of Programs: AXIOMATICS and DESCRIPTIVE POWER. MIT Research Lab Technical Report TR-200. Massachusetts Institute of Technology, Cambridge, MA.
- David Harel. 1979. First-Order Dynamic Logic. Lecture Notes in Computer Science, Vol. 68. Springer-Verlag, New York, NY. https://doi.org/10.1007/3-540-09237-4
- Prompt-to-Prompt Image Editing with Cross Attention Control. In International Conference on Learning Representations (ICLR).
- Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. 327–340.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NIPS).
- Billy S. Hollis. 1999. Visual Basic 6: Design, Specification, and Objects with Other (1st ed.). Prentice Hall PTR, Upper Saddle River, NJ, USA.
- Composer: Creative and Controllable Image Synthesis with Composable Conditions. In International Conference on Machine Learning (ICML).
- Composer: Creative and Controllable Image Synthesis with Composable Conditions. (2023).
- Region-Aware Diffusion for Zero-shot Text-driven Image Editing. arXiv preprint arXiv:2302.11797 (2023).
- Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion. In ACM International Conference on Multimedia (Lisboa, Portugal). 1085–1094.
- Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer. arXiv preprint arXiv:2305.05464 (2023).
- DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization. arXiv preprint arXiv:2211.10682 (2022).
- Multimodal Unsupervised Image-to-Image Translation. In European Conference on Computer Vision (ECCV). 172–189.
- ReVersion: Diffusion-Based Relation Inversion from Images. arXiv preprint arXiv:2303.13495 (2023).
- IEEE 2004. IEEE TCSC Executive Committee. In Proceedings of the IEEE International Conference on Web Services (ICWS ’04). IEEE Computer Society, Washington, DC, USA, 21–22. https://doi.org/10.1109/ICWS.2004.64
- Training-free Style Transfer Emerges from h-space in Diffusion models. arXiv preprint arXiv:2303.15403 (2023).
- Training Generative Adversarial Networks with Limited Data. In Advances in Neural Information Processing Systems (NeurIPS). 12104–12114.
- A Style-Based Generator Architecture for Generative Adversarial Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4401–4410.
- Imagic: Text-Based Real Image Editing with Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6007–6017.
- Donald E. Knuth. 1981. Seminumerical Algorithms. Addison-Wesley.
- Donald E. Knuth. 1997. The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd. ed.). Addison Wesley Longman Publishing Co., Inc.
- Donald E. Knuth. 1998. The Art of Computer Programming (3rd ed.). Fundamental Algorithms, Vol. 1. Addison Wesley Longman Publishing Co., Inc. (book).
- Wei-Chang Kong. 2001a. E-commerce and cultural values. IGI Publishing, Hershey, PA, USA, Name of chapter: The implementation of electronic commerce in SMEs in Singapore (Inbook-w-chap-w-type), 51–74. http://portal.acm.org/citation.cfm?id=887006.887010
- Wei-Chang Kong. 2001b. The implementation of electronic commerce in SMEs in Singapore (as Incoll). In E-commerce and cultural values. IGI Publishing, Hershey, PA, USA, 51–74. http://portal.acm.org/citation.cfm?id=887006.887010
- Wei-Chang Kong. 2002. Chapter 9. In E-commerce and cultural values (Incoll-w-text (chap 9) ’title’), Theerasak Thanasankit (Ed.). IGI Publishing, Hershey, PA, USA, 51–74. http://portal.acm.org/citation.cfm?id=887006.887010
- Wei-Chang Kong. 2003. The implementation of electronic commerce in SMEs in Singapore (Incoll). In E-commerce and cultural values, Theerasak Thanasankit (Ed.). IGI Publishing, Hershey, PA, USA, 51–74. http://portal.acm.org/citation.cfm?id=887006.887010
- Wei-Chang Kong. 2004. E-commerce and cultural values - (InBook-num-in-chap). IGI Publishing, Hershey, PA, USA, Chapter 9, 51–74. http://portal.acm.org/citation.cfm?id=887006.887010
- Wei-Chang Kong. 2005. E-commerce and cultural values (Inbook-text-in-chap). IGI Publishing, Hershey, PA, USA, Chapter: The implementation of electronic commerce in SMEs in Singapore, 51–74. http://portal.acm.org/citation.cfm?id=887006.887010
- Wei-Chang Kong. 2006. E-commerce and cultural values (Inbook-num chap). IGI Publishing, Hershey, PA, USA, Chapter (in type field) 22, 51–74. http://portal.acm.org/citation.cfm?id=887006.887010
- David Kosiur. 2001. Understanding Policy-Based Networking (2nd. ed.). Wiley, New York, NY.
- Multi-Concept Customization of Text-to-Image Diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1931–1941.
- Multi-Concept Customization of Text-to-Image Diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Gihyun Kwon and Jong Chul Ye. 2022. CLIPstyler: Image Style Transfer with a Single Text Condition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18062–18071.
- Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision 128 (2020), 2402–2417.
- Newton Lee. 2005. Interview with Bill Kinder: January 13, 2005. Video. Comput. Entertain. 3, 1, Article 4 (2005). https://doi.org/10.1145/1057270.1057278
- Portalis: using competitive online interactions to support aid initiatives for the homeless. In CHI ’08 extended abstracts on Human factors in computing systems (Florence, Italy). ACM, New York, NY, USA, 3873–3878. https://doi.org/10.1145/1358628.1358946
- StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing. arXiv preprint arXiv:2303.15649 (2023).
- Text to Image Generation with Semantic-Spatial Aware GAN. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18187–18196.
- RePaint: Inpainting Using Denoising Diffusion Probabilistic Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11461–11471.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
- Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS). 17359–17372.
- Null-text Inversion for Editing Real Images using Guided Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6038–6047.
- Sape Mullender (Ed.). 1993. Distributed systems (2nd Ed.). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA.
- National Gallery of Art. 2023. https://www.nga.gov/ Last accessed on 2023-09-12.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning (ICML).
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML). 8162–8171.
- Dave Novak. 2003. Solder man. Video. In ACM SIGGRAPH 2003 Video Review on Animation theater Program: Part I - Vol. 145 (July 27–27, 2003). ACM Press, New York, NY, 4. https://doi.org/99.9999/woot07-S422
- Barack Obama. 2008. A more perfect union. Video. Retrieved March 21, 2008 from http://video.google.com/videoplay?docid=6528042696351994555
- Swapping autoencoder for deep image manipulation. Advances in Neural Information Processing Systems 33 (2020), 7198–7211.
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In IEEE/CVF International Conference on Computer Vision (ICCV). 2085–2094.
- Charles J. Petrie. 1986a. New Algorithms for Dependency-Directed Backtracking (Master’s thesis). Technical Report. Austin, TX, USA.
- Charles J. Petrie. 1986b. New Algorithms for Dependency-Directed Backtracking (Master’s thesis). Master’s thesis. University of Texas at Austin, Austin, TX, USA.
- Pexels. 2023. https://www.pexels.com Last accessed on 2023-09-12.
- Poker-Edge.Com. 2006. Stats and Analysis. Retrieved June 7, 2006 from http://www.poker-edge.com/stats.php
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). 8748–8763.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125 (2022).
- Zero-shot text-to-image generation. In International Conference on Machine Learning (ICML). PMLR, 8821–8831.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
- Bernard Rous. 2008. The Enabling of Digital Libraries. Digital Libraries 12, 3, Article 5 (2008). To appear.
- DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 22500–22510.
- Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS). 36479–36494.
- StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation. In International Joint Conference on Artificial Intelligence (IJCAI). 4966–4972.
- Joseph Scientist. 2009. The fountain of youth. Patent No. 12345, Filed July 1st., 2008, Issued Aug. 9th., 2009.
- FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6490–6499.
- Stan W. Smith. 2010. An experiment in bibliographic mark-up: Parsing metadata for XML export. In Proceedings of the 3rd. annual workshop on Librarians and Computers (LAC ’10, Vol. 3), Reginald N. Smythe and Alexander Noble (Eds.). Paparazzi Press, Milan Italy, 422–431. https://doi.org/99.9999/woot07-S422
- Asad Z. Spector. 1990. Achieving application requirements. In Distributed Systems (2nd. ed.), Sape Mullender (Ed.). ACM Press, New York, NY, 19–33. https://doi.org/10.1145/90417.90738
- DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16494–16504.
- Key-Locked Rank One Editing for Text-to-Image Personalization. In ACM SIGGRAPH 2023 Conference Proceedings (Los Angeles, CA, USA) (SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, Article 12, 11 pages.
- The Barnes Foundation. 2023. https://www.barnesfoundation.org/ Last accessed on 2023-09-12.
- Harry Thornburg. 2001. Introduction to Bayesian Statistics. Retrieved March 2, 2005 from http://ccrma.stanford.edu/~jos/bayes/bayes.html
- UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image. ACM Transactions on Graphics 42, 4, Article 128 (2023), 10 pages.
- P+limit-from𝑃P+italic_P +: Extended Textual Conditioning in Text-to-Image Generation. arXiv preprint arXiv:2303.09522 (2023).
- Towards harmonized regional style transfer and manipulation for facial images. Computational Visual Media 9, 2 (2023), 351–366.
- Discriminative feature encoding for intrinsic image decomposition. Computational Visual Media 9, 3 (2023), 597–618.
- Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668 (2023).
- (new) Finding minimum congestion spanning trees. J. Exp. Algorithmics 5, Article 11 (2000). https://doi.org/10.1145/351827.384253
- (old) Finding minimum congestion spanning trees. J. Exp. Algorithmics 5 (2000), 11. https://doi.org/10.1145/351827.384253
- Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1900–1910.
- AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1316–1324.
- Paint by Example: Exemplar-based Image Editing with Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18381–18391.
- Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer. arXiv preprint arXiv:2303.08622 (2023).
- Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423 (2021).
- Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. Transactions on Machine Learning Research (2023).
- Cross-Modal Contrastive Learning for Text-to-Image Generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 833–842.
- Inversion-Based Style Transfer with Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10146–10156.
- Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning. In ACM SIGGRAPH 2022 Conference Proceedings. Article 12, 8 pages.
- A Unified Arbitrary Style Transfer Framework via Adaptive Contrastive Learning. ACM Transactions on Graphics 42, 5, Article 169 (2023), 16 pages.
- SINE: SINgle Image Editing with Text-to-Image Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6027–6037.
- DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5802–5810.
- Yuxin Zhang (91 papers)
- Weiming Dong (50 papers)
- Fan Tang (46 papers)
- Nisha Huang (10 papers)
- Haibin Huang (60 papers)
- Chongyang Ma (52 papers)
- Tong-Yee Lee (21 papers)
- Oliver Deussen (34 papers)
- Changsheng Xu (100 papers)