Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation (2403.05239v1)
Abstract: Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.
- Blended diffusion for text-driven editing of natural images. In CVPR, pages 18208–18218, 2022.
- Demystifying mmd gans. In ICLR, 2018.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks. In CVPR, pages 15050–15061, 2023.
- Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS, 35:16890–16902, 2022.
- document Rec. ITU-R. Methodology for the subjective assessment of video quality in multimedia applications. BT.1788, pages 1–13, 2007.
- Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2022.
- Fashionvil: Fashion-focused vision-and-language representation learning. In ECCV, pages 634–651. Springer, 2022.
- Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks. In CVPR, pages 2669–2680, 2023.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Prompt-to-prompt image editing with cross-attention control. In ICLR, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Lora: Low-rank adaptation of large language models. In ICLR, 2021.
- Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In CVPR, pages 618–629, 2023a.
- Humansd: A native skeleton-guided diffusion model for human image generation. arXiv preprint arXiv:2304.04269, 2023b.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, pages 2426–2435, 2022.
- Variational diffusion models. NeurIPS, 34:21696–21707, 2021.
- Hyperhuman: Hyper-realistic human generation with latent structural diffusion. arXiv preprint arXiv:2310.08579, 2023a.
- Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023b.
- Decoupled weight decay regularization. In ICLR, 2018.
- Generating images from captions with attention. In ICLR, 2015.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Synthesizing coherent story with auto-regressive latent diffusion models. arXiv preprint arXiv:2211.10950, 2022.
- Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Mirrorgan: Learning text-to-image generation by redescription. In CVPR, pages 1505–1514, 2019.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
- Generative adversarial text to image synthesis. In ICML, pages 1060–1069. PMLR, 2016.
- Deep image spatial transformation for person image generation. In CVPR, pages 7690–7699, 2020.
- Neural texture extraction and distribution for controllable person image synthesis. In CVPR, pages 13535–13544, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR, pages 16515–16525, 2022.
- Diffusers: State-of-the-art diffusion models, 2022.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–1324, 2018.
- A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024.
- Towards fine-grained human pose transfer with detail replenishing network. IEEE Transactions on Image Processing, 30:2422–2435, 2021.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
- Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
- ITI-GEN: Inclusive text-to-image generation. In ICCV, 2023a.
- Pise: Person image synthesis and editing with decoupled gan. In CVPR, pages 7982–7990, 2021.
- Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
- Exploring dual-task correlation for pose guided person image generation. In CVPR, pages 7713–7722, 2022.
- Patch-level contrastive learning via positional query for visual pre-training. In ICML, 2023c.
- Contextual image masking modeling via synergized contrasting without view augmentation for faster and better visual pretraining. In ICLR, 2023d.
- Patch-level contrasting without patch correspondence for accurate and dense contrastive representation learning. ICLR, 2023e.
- Cross attention based style distribution for controllable person image synthesis. In ECCV, pages 161–178. Springer, 2022.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, pages 5802–5810, 2019.
- Junyan Wang (25 papers)
- Zhenhong Sun (12 papers)
- Zhiyu Tan (26 papers)
- Xuanbai Chen (5 papers)
- Weihua Chen (35 papers)
- Hao Li (803 papers)
- Cheng Zhang (388 papers)
- Yang Song (298 papers)