FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation (2403.06775v1)
Abstract: Subject-driven generation has garnered significant interest recently due to its ability to personalize text-to-image generation. Typical works focus on learning the new subject's private attributes. However, an important fact has not been taken seriously that a subject is not an isolated new concept but should be a specialization of a certain category in the pre-trained model. This results in the subject failing to comprehensively inherit the attributes in its category, causing poor attribute-related generations. In this paper, motivated by object-oriented programming, we model the subject as a derived class whose base class is its semantic category. This modeling enables the subject to inherit public attributes from its category while learning its private attributes from the user-provided example. Specifically, we propose a plug-and-play method, Subject-Derived regularization (SuDe). It constructs the base-derived class modeling by constraining the subject-driven generated images to semantically belong to the subject's category. Extensive experiments under three baselines and two backbones on various subjects show that our SuDe enables imaginative attribute-related generations while maintaining subject fidelity. Codes will be open sourced soon at FaceChain (https://github.com/modelscope/facechain).
- Unsplash. In https://unsplash.com/.
- What is object-oriented programming? IEEE software, 5(3):10–20, 1988.
- Stable diffusion. In https://huggingface.co/CompVis/stable-diffusion-v-1-4-original, 2022.
- ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Emerging properties in self-supervised vision transformers. In Int. Conf. Comput. Vis., pages 9650–9660, 2021.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374, 2023a.
- Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023b.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. In Eur. Conf. Comput. Vis., pages 88–105. Springer, 2022.
- Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inform. Process. Syst., 34:19822–19835, 2021.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. Adv. Neural Inform. Process. Syst., 35:16890–16902, 2022.
- Make-a-scene: Scene-based text-to-image generation with human priors. In Eur. Conf. Comput. Vis., pages 89–106. Springer, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In Int. Conf. Learn. Represent., 2022.
- Vico: Detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971, 2023.
- Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst., 33:6840–6851, 2020.
- J JOYCE. Bayes’ theorem. Stanford Encyclopedia of Philosophy, 2003.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- Multi-concept customization of text-to-image diffusion. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1931–1941, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
- Language models as knowledge bases? Association for Computational Linguistics, 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060–1069. PMLR, 2016.
- Tim Rentsch. Object oriented programming. ACM Sigplan Notices, 17(9):51–57, 1982.
- High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 22500–22510, 2023a.
- Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023b.
- Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 35:36479–36494, 2022.
- Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, pages 255–269, 2021.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Augprompt: Knowledgeable augmented-trigger prompt for few-shot event classification. Information Processing & Management, 60(4):103153, 2023.
- Denoising diffusion implicit models. In Int. Conf. Learn. Represent., 2020.
- Bjarne Stroustrup. An overview of c++. In Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming, pages 7–18, 1986.
- Df-gan: A simple and effective baseline for text-to-image synthesis. In IEEE Conf. Comput. Vis. Pattern Recog., pages 16515–16525, 2022.
- Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
- Peter Wegner. Concepts and paradigms of object-oriented programming. ACM Sigplan Oops Messenger, 1(1):7–87, 1990.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. 2023.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1316–1324, 2018.
- Prospect: Expanded conditioning for the personalization of attribute-aware image generation. arXiv preprint arXiv:2305.16225, 2023.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5802–5810, 2019.
- Pengchong Qiao (8 papers)
- Lei Shang (21 papers)
- Chang Liu (864 papers)
- Baigui Sun (41 papers)
- Xiangyang Ji (159 papers)
- Jie Chen (602 papers)