Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LCM-Lookahead for Encoder-based Text-to-Image Personalization (2404.03620v1)

Published 4 Apr 2024 in cs.CV and cs.GR

Abstract: Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. 2022. State-of-the-Art in the Architecture, Methods and Applications of StyleGAN. In Computer Graphics Forum, Vol. 41. Wiley Online Library, 591–611.
  2. Cross-Image Attention for Zero-Shot Appearance Transfer. arXiv:2311.03335 [cs.CV]
  3. ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement. arXiv preprint arXiv:2104.02699 (2021).
  4. A Neural Space-Time Representation for Text-to-Image Personalization. ACM Transactions on Graphics (TOG) 42, 6 (2023), 1–10.
  5. HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing. arXiv:2111.15666 [cs.CV]
  6. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In SIGGRAPH Asia 2023 Conference Papers. 1–10.
  7. Break-A-Scene: Extracting Multiple Concepts from a Single Image. In SIGGRAPH Asia 2023 Conference Papers (, Sydney, NSW, Australia,) (SA ’23). Association for Computing Machinery, New York, NY, USA, Article 96, 12 pages. https://doi.org/10.1145/3610548.3618154
  8. Ollin Boer Bohan. 2023. Tiny AutoEncoder for Stable Diffusion. https://github.com/madebyollin/taesd.
  9. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22560–22570.
  10. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 23206–23217.
  11. Subject-driven Text-to-Image Generation via Apprenticeship Learning. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=wv3bHyQbX7
  12. On The Detection of Synthetic Images Generated by Diffusion Models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095167
  13. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. arXiv:1801.07698 [cs.CV]
  14. Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.
  15. Hyperinverter: Improving stylegan inversion via hypernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11389–11398.
  16. Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337 (2022).
  17. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. https://doi.org/10.48550/ARXIV.2208.01618
  18. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–13.
  19. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946 (2021).
  20. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
  21. Boot: Data-free distillation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference {normal-{\{{\normal-\\backslash\&}normal-}\}} Generative Modeling.
  22. SVDiff: Compact Parameter Space for Diffusion Fine-Tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 7323–7334.
  23. Style Aligned Image Generation via Shared Attention. (2023).
  24. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
  25. Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
  26. LoRA: Low-Rank Adaptation of Large Language Models. ArXiv abs/2106.09685 (2021).
  27. Curricularface: adaptive curriculum learning loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5901–5910.
  28. ReVersion: Diffusion-Based Relation Inversion from Images. arXiv preprint arXiv:2303.13495 (2023).
  29. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642 (2023).
  30. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems 35 (2022), 26565–26577.
  31. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4401–4410.
  32. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 15954–15964.
  33. Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=ymjI8feDTD
  34. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
  35. Multi-Concept Customization of Text-to-Image Diffusion. arXiv (2022).
  36. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720 (2023).
  37. PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  38. Pseudo Numerical Methods for Diffusion Models on Manifolds. In International Conference on Learning Representations. https://openreview.net/forum?id=PlKWVd2yBkY
  39. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=2uAaGwlP_V
  40. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. https://openreview.net/forum?id=4vGwQqviud5
  41. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference. arXiv:2310.04378 [cs.CV]
  42. LCM-LoRA: A Universal Stable-Diffusion Acceleration Module. arXiv:2311.05556 [cs.CV]
  43. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421 (2023).
  44. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
  45. Face identity disentanglement via latent space mapping. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–14.
  46. PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization. arXiv preprint arXiv:2312.06354 (2023).
  47. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=di52zR8xgf
  48. DreamFusion: Text-to-3D using 2D Diffusion. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=FjNys5c7VyY
  49. BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models. arXiv:2401.13974 [cs.CV]
  50. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 15932–15942.
  51. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
  52. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
  53. Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation. arXiv preprint arXiv:2008.00951 (2020).
  54. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
  55. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234–241.
  56. DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation. (2022).
  57. HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models. arXiv:2307.06949 [cs.CV]
  58. Simo Ryu. 2023. Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning. https://github.com/cloneofsimo/lora.
  59. Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. In International Conference on Learning Representations. https://openreview.net/forum?id=TIdIXIpzhoI
  60. Adversarial Diffusion Distillation. arXiv:2311.17042 [cs.CV]
  61. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
  62. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023).
  63. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
  64. Denoising Diffusion Implicit Models. In International Conference on Learning Representations.
  65. Consistency Models. arXiv:2303.01469 [cs.LG]
  66. Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
  67. Training-Free Consistent Text-to-Image Generation. arXiv:2402.03286 [cs.CV]
  68. Designing an Encoder for StyleGAN Image Manipulation. arXiv preprint arXiv:2102.02766 (2021).
  69. Face0: Instantaneously Conditioning a Text-to-Image Model on a Face. In SIGGRAPH Asia 2023 Conference Papers (¡conf-loc¿, ¡city¿Sydney¡/city¿, ¡state¿NSW¡/state¿, ¡country¿Australia¡/country¿, ¡/conf-loc¿) (SA ’23). Association for Computing Machinery, New York, NY, USA, Article 94, 10 pages. https://doi.org/10.1145/3610548.3618249
  70. P+limit-from𝑃P+italic_P +: Extended Textual Conditioning in Text-to-Image Generation. arXiv preprint arXiv:2303.09522 (2023).
  71. End-to-End Diffusion Latent Optimization Improves Classifier Guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 7280–7290.
  72. InstantID: Zero-shot Identity-Preserving Generation in Seconds. arXiv preprint arXiv:2401.07519 (2024).
  73. High-Fidelity GAN Inversion for Image Attribute Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  74. DIRE for Diffusion-Generated Image Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22445–22455.
  75. ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 15943–15953.
  76. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7623–7633.
  77. StyleAlign: Analysis and Applications of Aligned StyleGAN Models. arXiv:2110.11323 [cs.CV]
  78. GAN Inversion: A Survey. arXiv:2101.05278 [cs.CV]
  79. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. arXiv (2023).
  80. FaceStudio: Put Your Face Everywhere in Seconds. arXiv:2312.02663 [cs.CV]
  81. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023).
  82. One-step Diffusion with Distribution Matching Distillation. CVPR (2024).
  83. Inserting Anybody in Diffusion Models via Celeb Basis. arXiv preprint arXiv:2306.00926 (2023).
  84. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  85. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Reddit Logo Streamline Icon: https://streamlinehq.com