Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 98 tok/s Pro
GPT OSS 120B 424 tok/s Pro
Kimi K2 164 tok/s Pro
2000 character limit reached

Cross Initialization for Personalized Text-to-Image Generation (2312.15905v1)

Published 26 Dec 2023 in cs.CV

Abstract: Recently, there has been a surge in face personalization techniques, benefiting from the advanced capabilities of pretrained text-to-image diffusion models. Among these, a notable method is Textual Inversion, which generates personalized images by inverting given images into textual embeddings. However, methods based on Textual Inversion still struggle with balancing the trade-off between reconstruction quality and editability. In this study, we examine this issue through the lens of initialization. Upon closely examining traditional initialization methods, we identified a significant disparity between the initial and learned embeddings in terms of both scale and orientation. The scale of the learned embedding can be up to 100 times greater than that of the initial embedding. Such a significant change in the embedding could increase the risk of overfitting, thereby compromising the editability. Driven by this observation, we introduce a novel initialization method, termed Cross Initialization, that significantly narrows the gap between the initial and learned embeddings. This method not only improves both reconstruction and editability but also reduces the optimization steps from 5000 to 320. Furthermore, we apply a regularization term to keep the learned embedding close to the initial embedding. We show that when combined with Cross Initialization, this regularization term can effectively improve editability. We provide comprehensive empirical evidence to demonstrate the superior performance of our method compared to the baseline methods. Notably, in our experiments, Cross Initialization is the only method that successfully edits an individual's facial expression. Additionally, a fast version of our method allows for capturing an input image in roughly 26 seconds, while surpassing the baseline methods in terms of both reconstruction and editability. Code will be made publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Image2stylegan: How to embed images into the stylegan latent space? In ICCV, pages 4432–4441, 2019.
  2. A neural space-time representation for text-to-image personalization. arXiv preprint arXiv:2305.15391, 2023.
  3. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06925, 2023.
  4. Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023.
  5. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  6. High-fidelity gan inversion with padding space. In ECCV, pages 36–53. Springer, 2022.
  7. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  8. Photoverse: Tuning-free image customization with text-to-image diffusion models. arXiv preprint arXiv:2309.05793, 2023a.
  9. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023b.
  10. Custom-edit: Text-guided image editing with customized diffusion models. arXiv preprint arXiv:2305.15779, 2023.
  11. “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In ECCV, pages 558–577. Springer, 2022.
  12. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  13. Arcface: Additive angular margin loss for deep face recognition. In CVPR, pages 4690–4699, 2019.
  14. Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
  15. Cogview: Mastering text-to-image generation via transformers. In NeurIPS, pages 19822–19835, 2021.
  16. Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337, 2022.
  17. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  18. Encoder-based domain tuning for fast personalization of text-to-image models. TOG, 42(4):1–13, 2023.
  19. Image processing using multi-code gan prior. In CVPR, pages 3012–3021, 2020.
  20. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023.
  21. Vico: Detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971, 2023.
  22. A data perspective on enhanced identity preservation for diffusion personalization. arXiv preprint arXiv:2311.04315, 2023.
  23. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  24. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
  25. Magicapture: High-resolution multi-concept portrait customization. arXiv preprint arXiv:2309.06895, 2023.
  26. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
  27. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2018.
  28. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023.
  29. Multi-concept customization of text-to-image diffusion. In CVPR, pages 1931–1941, 2023.
  30. Magicmix: Semantic mixing with diffusion models. arXiv preprint arXiv:2210.16056, 2022.
  31. Magic3d: High-resolution text-to-3d content creation. In CVPR, pages 300–309, 2023.
  32. Deep learning face attributes in the wild. In ICCV, pages 3730–3738, 2015.
  33. Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319, 2023.
  34. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, pages 12663–12673, 2023.
  35. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
  36. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  37. Spatially-adaptive multilayer selection for gan inversion and editing. In CVPR, pages 11399–11409, 2022.
  38. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, pages 2085–2094, 2021.
  39. Adversarial latent autoencoders. In CVPR, pages 14104–14113, 2020.
  40. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  41. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
  42. Zero-shot text-to-image generation. In ICML, pages 8821–8831, 2021.
  43. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  44. Generative adversarial text to image synthesis. In ICML, pages 1060–1069, 2016.
  45. Encoding in style: a stylegan encoder for image-to-image translation. In CVPR, pages 2287–2296, 2021.
  46. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  47. Learning ordered representations with nested dropout. In ICML, pages 1746–1754, 2014.
  48. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  49. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023a.
  50. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023b.
  51. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022.
  52. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  53. Personalized federated learning using hypernetworks. In ICML, pages 9489–9502, 2021.
  54. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  55. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
  56. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027, 2023.
  57. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  58. Key-locked rank one editing for text-to-image personalization. In SIGGRAPH, 2023.
  59. Designing an encoder for stylegan image manipulation. TOG, 40(4):1–14, 2021.
  60. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
  61. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
  62. Concept decomposition for visual exploration and inspiration. arXiv preprint arXiv:2305.18203, 2023.
  63. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  64. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  65. High-fidelity gan inversion for image attribute editing. In CVPR, pages 11379–11388, 2022.
  66. Singleinsert: Inserting new concepts from a single image into text-to-image models for flexible editing. arXiv preprint arXiv:2310.08094, 2023.
  67. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  68. Inserting anybody in diffusion models via celeb basis. In NeurIPS, 2023.
  69. Enhancing detail preservation for customized text-to-image generation: A regularization-free approach. arXiv preprint arXiv:2305.13579, 2023.
  70. In-domain gan inversion for real image editing. In ECCV, pages 592–608, 2020a.
  71. Improved stylegan embedding: Where are the good latents? arXiv preprint arXiv:2012.09036, 2020b.
Citations (4)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.