Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation (2403.20231v1)

Published 29 Mar 2024 in cs.CV

Abstract: Concept personalization methods enable large text-to-image models to learn specific subjects (e.g., objects/poses/3D models) and synthesize renditions in new contexts. Given that the image references are highly biased towards visual attributes, state-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space. In this study, we proposed a more challenging setting, namely fine-grained visual appearance personalization. Different from existing methods, we allow users to provide a sentence describing the desired attributes. A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes. These augmented data allow for refining the model's understanding of the target attribute while mitigating the impact of unrelated attributes. At the inference stage, adjustments are conducted on semantic space through the learned target and non-target embeddings to further enhance the disentanglement of target attributes. Extensive experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts, thus improving the controllability and flexibility of personalization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF international conference on computer vision, pages 4432–4441, 2019.
  2. Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8296–8305, 2020.
  3. A neural space-time representation for text-to-image personalization. ACM Transactions on Graphics (TOG), 42(6):1–10, 2023.
  4. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023.
  5. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
  6. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  7. Disentangled and controllable face image generation via 3d imitative-contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5154–5163, 2020.
  8. Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337, 2022.
  9. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2023a.
  11. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG), 42(4):1–13, 2023b.
  12. Svdiff: Compact parameter space for diffusion fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7323–7334, 2023.
  13. Ganspace: Discovering interpretable gan controls. In Advances in Neural Information Processing Systems, pages 9841–9850. Curran Associates, Inc., 2020.
  14. Prompt-to-prompt image editing with cross-attention control. In The Eleventh International Conference on Learning Representations, 2023.
  15. Controllable multi-domain semantic artwork synthesis. Computational Visual Media, 10(2):355–373, 2024.
  16. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  17. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  18. Alias-free generative adversarial networks. In Advances in Neural Information Processing Systems, pages 852–863. Curran Associates, Inc., 2021.
  19. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  20. Diffusion models already have a semantic latent space. In The Eleventh International Conference on Learning Representations, 2023.
  21. Countering language drift via visual grounding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4385–4395, Hong Kong, China, 2019. Association for Computational Linguistics.
  22. Contrastive learning for diverse disentangled foreground generation. In European Conference on Computer Vision, pages 334–351. Springer, 2022.
  23. Countering language drift with seeded iterated learning. In International Conference on Machine Learning, pages 6437–6447. PMLR, 2020.
  24. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
  25. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  26. OpenAI. GPT-3.5-turbo: Language Models for Various Applications. 2022. https://platform.openai.com/docs/guides/gpt#gpt35-turbo.
  27. OpenAI. Dall·e 3 system card. Technical report, OpenAI, 2023a. https://openai.com/dall-e-3.
  28. OpenAI. Gpt-4 technical report, 2023b.
  29. OpenAI. GPT-4v System Card, 2023c.
  30. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085–2094, 2021.
  31. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  32. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  36. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  37. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, pages 36479–36494. Curran Associates, Inc., 2022.
  38. Improved techniques for training gans. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2016.
  39. Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1532–1540, 2021.
  40. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9243–9252, 2020a.
  41. Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence, 44(4):2004–2018, 2020b.
  42. Styledrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
  43. Concept decomposition for visual exploration and inspiration. ACM Transactions on Graphics (TOG), 42(6):1–13, 2023.
  44. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  45. Styleadapter: A single-pass lora-free model for stylized image generation. arXiv preprint arXiv:2309.01770, 2023a.
  46. Stylediffusion: Controllable disentangled style transfer via diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7677–7689, 2023b.
  47. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2023.
  48. Domain enhanced arbitrary image style transfer via contrastive learning. In ACM SIGGRAPH 2022 conference proceedings, pages 1–8, 2022.
  49. Prospect: Prompt spectrum for attribute-aware personalization of diffusion models. ACM Transactions on Graphics (TOG), 42(6):1–14, 2023a.
  50. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10146–10156, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. You Wu (60 papers)
  2. Kean Liu (2 papers)
  3. Xiaoyue Mi (9 papers)
  4. Fan Tang (46 papers)
  5. Juan Cao (73 papers)
  6. Jintao Li (44 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub