Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation (2403.06775v1)

Published 11 Mar 2024 in cs.CV

Abstract: Subject-driven generation has garnered significant interest recently due to its ability to personalize text-to-image generation. Typical works focus on learning the new subject's private attributes. However, an important fact has not been taken seriously that a subject is not an isolated new concept but should be a specialization of a certain category in the pre-trained model. This results in the subject failing to comprehensively inherit the attributes in its category, causing poor attribute-related generations. In this paper, motivated by object-oriented programming, we model the subject as a derived class whose base class is its semantic category. This modeling enables the subject to inherit public attributes from its category while learning its private attributes from the user-provided example. Specifically, we propose a plug-and-play method, Subject-Derived regularization (SuDe). It constructs the base-derived class modeling by constraining the subject-driven generated images to semantically belong to the subject's category. Extensive experiments under three baselines and two backbones on various subjects show that our SuDe enables imaginative attribute-related generations while maintaining subject fidelity. Codes will be open sourced soon at FaceChain (https://github.com/modelscope/facechain).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Unsplash. In https://unsplash.com/.
  2. What is object-oriented programming? IEEE software, 5(3):10–20, 1988.
  3. Stable diffusion. In https://huggingface.co/CompVis/stable-diffusion-v-1-4-original, 2022.
  4. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  5. Emerging properties in self-supervised vision transformers. In Int. Conf. Comput. Vis., pages 9650–9660, 2021.
  6. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  7. Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
  8. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023b.
  9. Vqgan-clip: Open domain image generation and editing with natural language guidance. In Eur. Conf. Comput. Vis., pages 88–105. Springer, 2022.
  10. Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inform. Process. Syst., 34:19822–19835, 2021.
  11. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Adv. Neural Inform. Process. Syst., 35:16890–16902, 2022.
  12. Make-a-scene: Scene-based text-to-image generation with human priors. In Eur. Conf. Comput. Vis., pages 89–106. Springer, 2022.
  13. An image is worth one word: Personalizing text-to-image generation using textual inversion. In Int. Conf. Learn. Represent., 2022.
  14. Vico: Detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971, 2023.
  15. Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst., 33:6840–6851, 2020.
  16. J JOYCE. Bayes’ theorem. Stanford Encyclopedia of Philosophy, 2003.
  17. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  18. Multi-concept customization of text-to-image diffusion. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1931–1941, 2023.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  20. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
  21. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  22. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  23. Language models as knowledge bases? Association for Computational Linguistics, 2019.
  24. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  25. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  26. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  27. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060–1069. PMLR, 2016.
  28. Tim Rentsch. Object oriented programming. ACM Sigplan Notices, 17(9):51–57, 1982.
  29. High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022.
  30. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 22500–22510, 2023a.
  31. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023b.
  32. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 35:36479–36494, 2022.
  33. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, pages 255–269, 2021.
  34. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  35. Augprompt: Knowledgeable augmented-trigger prompt for few-shot event classification. Information Processing & Management, 60(4):103153, 2023.
  36. Denoising diffusion implicit models. In Int. Conf. Learn. Represent., 2020.
  37. Bjarne Stroustrup. An overview of c++. In Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming, pages 7–18, 1986.
  38. Df-gan: A simple and effective baseline for text-to-image synthesis. In IEEE Conf. Comput. Vis. Pattern Recog., pages 16515–16525, 2022.
  39. Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  40. Peter Wegner. Concepts and paradigms of object-oriented programming. ACM Sigplan Oops Messenger, 1(1):7–87, 1990.
  41. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. 2023.
  42. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1316–1324, 2018.
  43. Prospect: Expanded conditioning for the personalization of attribute-aware image generation. arXiv preprint arXiv:2305.16225, 2023.
  44. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5802–5810, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Pengchong Qiao (8 papers)
  2. Lei Shang (21 papers)
  3. Chang Liu (864 papers)
  4. Baigui Sun (41 papers)
  5. Xiangyang Ji (159 papers)
  6. Jie Chen (602 papers)

Summary

Enhancing Subject-Driven Generation with Subject-Derived Regularization

Introduction

Subject-driven generation has emerged as a fascinating niche within the text-to-image generation domain, focusing on personalizing generation for specific subjects, like pets or characters, based on minimal user-provided examples. A novel paper in this field introduces an elegant solution to a persistent problem: the inability of existing models to capture the full breadth of attributes related to a subject, particularly when only a single example image is provided. This work proposes an innovative method, named Subject-Derived regularization (SuDe), that frames the problem in terms of object-oriented programming, enabling a subject to inherit attributes from its broader category to fill in gaps left by the limited user-provided data.

Core Proposal

At the heart of the proposed SuDe method is the conceptual modeling of a subject as a derived class that inherits public attributes from a base class, its semantic category, found in a pre-trained model. This dual-focus approach ensures that while specific, private attributes are learned directly from the provided subject image, a wider range of generalized, public attributes are inherited from the category, enhancing attribute-related generation capabilities. This insight addresses the shortcoming where models fail to generate images of a subject performing actions or displaying attributes not explicitly present in the provided example image but are typical for the subject's category.

Subject-Derived Regularization

The implementation of SuDe involves a regularization method designed to ensure generated images of a subject semantically belong to its category, for example, ensuring images of "Spike," a specific dog, are recognized as belonging to the broader "Dog" category. This method crucially depends on revealing the implicit classifier within the diffusion model employed for generation, exploiting the model's inherent understanding of categories to guide the generation process. Additionally, a strategy to prevent over-optimization, termed loss truncation, ensures the method respects the intrinsic uncertainty present at each step of the diffusion process, maintaining the generative model's stability and fidelity to the subject.

Experimental Validation

Extensive experiments conducted under various configurations and backbones solidify SuDe's effectiveness in bolstering imaginative, attribute-rich generation while conserving the subject's fidelity. The method is evaluated across different baseline models, showcasing its plug-and-play compatibility and the substantial improvements it delivers in terms of both attribute alignment and subject fidelity. Notably, the method demonstrates significant strides in performance when applied to one-shot scenarios, presenting a compelling solution to a widely acknowledged challenge in the field.

Theoretical Insights

Beyond the technical implementation, the paper provides a robust theoretical analysis illustrating how SuDe effectively models the conditional distribution of generating a subject with both private and inherited attributes. This insight further clarifies the operational mechanism of SuDe, grounding its empirical success in a solid theoretical foundation.

Future Directions

The introduction of SuDe not only addresses a current limitation in subject-driven generation but also opens avenues for future research. The paper's object-oriented framing introduces a novel perspective that could inspire subsequent methods in both generative AI and other domains. Furthermore, the practical and theoretical implications of this work hint at broader applications, potentially extending beyond image generation to areas like personalized content creation or adaptive learning systems.

Conclusion

In summary, this paper presents a significant advance in subject-driven generation through its intuitive yet powerful Subject-Derived regularization method. By enabling subjects to inherit attributes from their broader categories, SuDe enriches the generative model's capacity for attribute-related imagery, underscoring the potential of integrating object-oriented concepts into generative AI.

Github Logo Streamline Icon: https://streamlinehq.com