Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation (2403.05239v1)

Published 8 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Blended diffusion for text-driven editing of natural images. In CVPR, pages 18208–18218, 2022.
  2. Demystifying mmd gans. In ICLR, 2018.
  3. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  4. Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks. In CVPR, pages 15050–15061, 2023.
  5. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  6. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
  7. Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS, 35:16890–16902, 2022.
  8. document Rec. ITU-R. Methodology for the subjective assessment of video quality in multimedia applications. BT.1788, pages 1–13, 2007.
  9. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021.
  10. Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2022.
  11. Fashionvil: Fashion-focused vision-and-language representation learning. In ECCV, pages 634–651. Springer, 2022.
  12. Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks. In CVPR, pages 2669–2680, 2023.
  13. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  14. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2022.
  15. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017.
  16. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  17. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
  18. Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In CVPR, pages 618–629, 2023a.
  19. Humansd: A native skeleton-guided diffusion model for human image generation. arXiv preprint arXiv:2304.04269, 2023b.
  20. Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, pages 2426–2435, 2022.
  21. Variational diffusion models. NeurIPS, 34:21696–21707, 2021.
  22. Hyperhuman: Hyper-realistic human generation with latent structural diffusion. arXiv preprint arXiv:2310.08579, 2023a.
  23. Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023b.
  24. Decoupled weight decay regularization. In ICLR, 2018.
  25. Generating images from captions with attention. In ICLR, 2015.
  26. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  27. Synthesizing coherent story with auto-regressive latent diffusion models. arXiv preprint arXiv:2211.10950, 2022.
  28. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  29. Mirrorgan: Learning text-to-image generation by redescription. In CVPR, pages 1505–1514, 2019.
  30. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  31. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
  32. Generative adversarial text to image synthesis. In ICML, pages 1060–1069. PMLR, 2016.
  33. Deep image spatial transformation for person image generation. In CVPR, pages 7690–7699, 2020.
  34. Neural texture extraction and distribution for controllable person image synthesis. In CVPR, pages 13535–13544, 2022.
  35. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  36. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
  37. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
  38. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  39. Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR, pages 16515–16525, 2022.
  40. Diffusers: State-of-the-art diffusion models, 2022.
  41. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–1324, 2018.
  42. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024.
  43. Towards fine-grained human pose transfer with detail replenishing network. IEEE Transactions on Image Processing, 30:2422–2435, 2021.
  44. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
  45. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  46. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
  47. ITI-GEN: Inclusive text-to-image generation. In ICCV, 2023a.
  48. Pise: Person image synthesis and editing with decoupled gan. In CVPR, pages 7982–7990, 2021.
  49. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
  50. Exploring dual-task correlation for pose guided person image generation. In CVPR, pages 7713–7722, 2022.
  51. Patch-level contrastive learning via positional query for visual pre-training. In ICML, 2023c.
  52. Contextual image masking modeling via synergized contrasting without view augmentation for faster and better visual pretraining. In ICLR, 2023d.
  53. Patch-level contrasting without patch correspondence for accurate and dense contrastive representation learning. ICLR, 2023e.
  54. Cross attention based style distribution for controllable person image synthesis. In ECCV, pages 161–178. Springer, 2022.
  55. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, pages 5802–5810, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Junyan Wang (25 papers)
  2. Zhenhong Sun (12 papers)
  3. Zhiyu Tan (26 papers)
  4. Xuanbai Chen (5 papers)
  5. Weihua Chen (35 papers)
  6. Hao Li (803 papers)
  7. Cheng Zhang (388 papers)
  8. Yang Song (298 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com