High-fidelity Person-centric Subject-to-Image Synthesis (2311.10329v5)
Abstract: Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.
- Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023a.
- Spatext: Spatio-textual representation for controllable image generation. In CVPR, pages 18370–18380, 2023b.
- Token merging for fast stable diffusion. In CVPR, pages 4598–4602, 2023.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022.
- Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023.
- Cogview: Mastering text-to-image generation via transformers. In NeruIPS, pages 19822–19835, 2021.
- Change detection on remote sensing images using dual-branch multilevel intertemporal network. IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023a.
- A lightweight collective-attention network for change detection. In ACM MM, page 8195–8203, 2023b.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023a.
- Designing an encoder for fast personalization of text-to-image models. In Siggraph, 2023b.
- Vico: Detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971, 2023.
- Scaling up gans for text-to-image synthesis. In CVPR, pages 10124–10134, 2023.
- A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019.
- X&fuse: Fusing visual information in text-to-image generation. arXiv preprint arXiv:2303.01000, 2023.
- Multi-concept customization of text-to-image diffusion. In CVPR, pages 1931–1941, 2023.
- Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
- Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG), 42(1):1–13, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022.
- Collage diffusion. arXiv preprint arXiv:2303.00262, 2023.
- Facenet: A unified embedding for face recognition and clustering. In CVPR, pages 815–823, 2015.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, pages 25278–25294, 2022.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
- Freeu: Free lunch in diffusion u-net. arXiv preprint arXiv:2309.11497, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265, 2015.
- Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH, pages 1–11, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Ca-gan: Object placement via coalescing attention based generative adversarial network. In ICME, pages 2375–2380, 2023a.
- Plsnet: Position-aware gcn-based autism spectrum disorder diagnosis via fc learning and rois sifting. Computers in Biology and Medicine, page 107184, 2023b.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
- Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
- Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.
- Magicfusion: Boosting text-to-image generation performance by fusing diffusion models. arXiv preprint arXiv:2303.13126, 2023.
- Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In CVPR, pages 22490–22499, 2023.
- Yibin Wang (26 papers)
- Weizhong Zhang (40 papers)
- Jianwei Zheng (10 papers)
- Cheng Jin (76 papers)