PALP: Prompt Aligned Personalization of Text-to-Image Models (2401.06105v1)
Abstract: Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfiLLMent of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a \emph{single} prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.
- A neural space-time representation for text-to-image personalization. CoRR, abs/2305.15391, 2023.
- Domain-agnostic tuning-encoder for fast personalization of text-to-image models. CoRR, abs/2307.06925, 2023.
- Blended diffusion for text-driven editing of natural images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 18187–18197. IEEE, 2022.
- Break-a-scene: Extracting multiple concepts from a single image. CoRR, abs/2305.16311, 2023a.
- Blended latent diffusion. ACM Trans. Graph., 42(4):149:1–149:11, 2023b.
- Text2live: Text-driven layered image and video editing. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XV, pages 707–723. Springer, 2022.
- Instructpix2pix: Learning to follow image editing instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 18392–18402. IEEE, 2023.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023a.
- The hidden language of diffusion models. arXiv preprint arXiv:2306.00966, 2023b.
- Subject-driven text-to-image generation via apprenticeship learning. CoRR, abs/2304.00186, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
- Make-a-scene: Scene-based text-to-image generation with human priors. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XV, pages 89–106. Springer, 2022.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Trans. Graph., 41(4):141:1–141:13, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a.
- Encoder-based domain tuning for fast personalization of text-to-image models. ACM Trans. Graph., 42(4):150:1–150:13, 2023b.
- Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023.
- Delta denoising score. CoRR, abs/2304.07090, 2023a.
- Prompt-to-prompt image editing with cross-attention control. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023b.
- Classifier-free diffusion guidance. CoRR, abs/2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Reversion: Diffusion-based relation inversion from images. CoRR, abs/2303.13495, 2023.
- Word-as-image for semantic typography. ACM Trans. Graph., 42(4):151:1–151:11, 2023.
- Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 1911–1920. IEEE, 2023.
- Noise-free score distillation. arXiv preprint arXiv:2310.17590, 2023.
- Imagic: Text-based real image editing with diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6007–6017. IEEE, 2023.
- Multi-concept customization of text-to-image diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 1931–1941. IEEE, 2023.
- Null-text inversion for editing real images using guided diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6038–6047. IEEE, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Zero-shot image-to-image translation, 2023.
- Styleclip: Text-driven manipulation of stylegan imagery. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 2065–2074. IEEE, 2021.
- Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427, 2023.
- Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
- Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021.
- Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment, 2023.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22500–22510. IEEE, 2023.
- Simo Ryu. Low-rank adaptation for fast text-to-image diffusion fine-tuning. [https://github.com/cloneofsimo/lora](https://github.com/cloneofsimo/lora
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- A picture is worth a thousand words: Principled recaptioning improves image generation. arXiv preprint arXiv:2310.16656, 2023.
- Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Diffusion guided domain adaptation of image generators. CoRR, abs/2212.04473, 2022.
- Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH 2023, Los Angeles, CA, USA, August 6-10, 2023, pages 12:1–12:11. ACM, 2023.
- Plug-and-play diffusion features for text-driven image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 1921–1930. IEEE, 2023.
- Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
- Face0: Instantaneously conditioning a text-to-image model on a face. CoRR, abs/2306.06638, 2023.
- Concept decomposition for visual exploration and inspiration. CoRR, abs/2305.18203, 2023.
- P+: extended textual conditioning in text-to-image generation. CoRR, abs/2303.09522, 2023.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. CoRR, abs/2305.16213, 2023.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
- Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis. CoRR, abs/2304.03869, 2023.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. CoRR, abs/2308.06721, 2023.
- Enhancing detail preservation for customized text-to-image generation: A regularization-free approach. CoRR, abs/2305.13579, 2023.
- Moab Arar (13 papers)
- Andrey Voynov (15 papers)
- Amir Hertz (21 papers)
- Omri Avrahami (12 papers)
- Shlomi Fruchter (8 papers)
- Yael Pritch (19 papers)
- Daniel Cohen-Or (172 papers)
- Ariel Shamir (47 papers)