Papers
Topics
Authors
Recent
2000 character limit reached

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation (2312.16272v2)

Published 26 Dec 2023 in cs.CV

Abstract: Recent advancements in subject-driven image generation have led to zero-shot generation, yet precise selection and focus on crucial subject representations remain challenging. Addressing this, we introduce the SSR-Encoder, a novel architecture designed for selectively capturing any subject from single or multiple reference images. It responds to various query modalities including text and masks, without necessitating test-time fine-tuning. The SSR-Encoder combines a Token-to-Patch Aligner that aligns query inputs with image patches and a Detail-Preserving Subject Encoder for extracting and preserving fine features of the subjects, thereby generating subject embeddings. These embeddings, used in conjunction with original text embeddings, condition the generation process. Characterized by its model generalizability and efficiency, the SSR-Encoder adapts to a range of custom models and control modules. Enhanced by the Embedding Consistency Regularization Loss for improved training, our extensive experiments demonstrate its effectiveness in versatile and high-quality image generation, indicating its broad applicability. Project page: https://ssr-encoder.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  4. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
  5. Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
  6. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023b.
  7. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  11. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
  12. Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228, 2023.
  13. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  14. Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023.
  15. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  16. Clipscore: A reference-free evaluation metric for image captioning, 2022.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  18. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  19. Openclip, 2021. If you use this software, please cite it as below.
  20. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  23. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  24. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720, 2023a.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  26. Pluralistic aging diffusion autoencoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22613–22623, 2023c.
  27. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023d.
  28. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
  29. Mikubill. sd-webui-controlnet, 2023. GitHub repository.
  30. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  31. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  32. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  33. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  34. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  35. Learning transferable visual models from natural language supervision, 2021.
  36. Zero-shot text-to-image generation, 2021.
  37. Hierarchical text-conditional image generation with clip latents, 2022.
  38. Yuval Atzmon Amit H. Bermano Gal Chechik Daniel Cohen-Or Rinon Gal, Moab Arar. Encoder-based domain tuning for fast personalization of text-to-image models, 2023.
  39. High-resolution image synthesis with latent diffusion models, 2021.
  40. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  41. Photorealistic text-to-image diffusion models with deep language understanding, 2022a.
  42. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022b.
  43. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  44. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
  45. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  46. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  47. Face0: Instantaneously conditioning a text-to-image model on a face. arXiv preprint arXiv:2306.06638, 2023.
  48. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  49. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
  50. Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023.
  51. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
  52. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  53. Ipdreamer: Appearance-controllable 3d object generation with image prompts. arXiv preprint arXiv:2310.05375, 2023.
  54. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  55. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  56. Real-world image variation by aligning diffusion inversion chain. arXiv preprint arXiv:2305.18729, 2023.
  57. Uni-controlnet: All-in-one control to text-to-image diffusion models. arXiv preprint arXiv:2305.16322, 2023.
  58. Extract free dense labels from clip. In European Conference on Computer Vision, pages 696–712. Springer, 2022.
Citations (32)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 4 likes about this paper.