Papers
Topics
Authors
Recent
2000 character limit reached

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis (2402.18078v2)

Published 28 Feb 2024 in cs.CV

Abstract: Diffusion model is a promising approach to image generation and has been employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive performance. While existing methods simply align the person appearance to the target pose, they are prone to overfitting due to the lack of a high-level semantic understanding on the source person image. In this paper, we propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence of image-caption pairs and textual prompts, we develop a novel training paradigm purely based on images to control the generation process of a pre-trained text-to-image diffusion model. A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt. This allows for the decoupling of fine-grained appearance and pose information controls at different stages, and thus circumventing the potential overfitting problem. To generate more realistic texture details, a hybrid-granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. Both quantitative and qualitative experimental results on the DeepFashion benchmark demonstrate the superiority of our method over the state of the arts for PGPIS. Code is available at https://github.com/YanzuoLu/CFLD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Person image synthesis via denoising diffusion model. In CVPR, page 5968–5976, 2023.
  2. Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
  3. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, page 7291–7299, 2017.
  4. Imagenet: A large-scale hierarchical image database. In CVPR, page 248–255, 2009.
  5. Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
  6. A variational u-net for conditional appearance and shape generation. In CVPR, page 8857–8866, 2018.
  7. Taming transformers for high-resolution image synthesis. In CVPR, page 12873–12883, 2021.
  8. Densepose: Dense human pose estimation in the wild. In CVPR, pages 7297–7306, 2018.
  9. Controllable person image synthesis with pose-constrained latent diffusion. In ICCV, page 22768–22777, 2023.
  10. Deep residual learning for image recognition. In CVPR, page 770–778, 2016.
  11. Masked autoencoders are scalable vision learners. In CVPR, page 16000–16009, 2022.
  12. Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626, 2022.
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  14. Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
  15. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  16. Adam: A method for stochastic optimization. In ICLR, 2015.
  17. Dense intrinsic appearance flow for human pose transfer. In CVPR, page 3693–3702, 2019.
  18. Design guidelines for prompt engineering text-to-image generative models. In CHI, pages 1–23, 2022.
  19. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In ICCV, pages 5904–5913, 2019.
  20. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, page 1096–1104, 2016.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, page 10012–10022, 2021.
  22. Learning semantic person image generation by region-adaptive normalization. In CVPR, page 10806–10815, 2021.
  23. Pose guided person image generation. In NeurIPS, 2017.
  24. Disentangled person image generation. In CVPR, pages 99–108, 2018.
  25. Controllable person image synthesis with attribute-decomposed gan. In CVPR, page 5084–5093, 2020.
  26. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv:2302.08453, 2023.
  27. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019.
  28. Best prompts for text-to-image models and how to find them. In SIGIR, pages 2067–2071, 2023.
  29. Learning transferable visual models from natural language supervision. In ICML, page 8748–8763, 2021.
  30. Deep image spatial transformation for person image generation. In CVPR, page 7690–7699, 2020.
  31. Neural texture extraction and distribution for controllable person image synthesis. In CVPR, page 13535–13544, 2022.
  32. High-resolution image synthesis with latent diffusion models. In CVPR, page 10684–10695, 2022.
  33. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, page 234–241, 2015.
  34. Improved techniques for training gans. NeurIPS, 29, 2016.
  35. Style and pose control for image synthesis of humans from a single monocular view. arXiv:2102.11263, 2021.
  36. Deformable gans for pose-based human image generation. In CVPR, page 3408–3416, 2018.
  37. Denoising diffusion implicit models. In ICLR, 2021a.
  38. Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
  39. Xinggan for person image generation. In ECCV, page 717–734, 2020.
  40. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930, 2023.
  41. Leonid Nisonovich Vaserstein. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72, 1969.
  42. Attention is all you need. In NeurIPS, 2017.
  43. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  44. Image quality assessment: From error visibility to structural similarity. TIP, 13(4):600–612, 2004.
  45. Paint by example: Exemplar-based image editing with diffusion models. In CVPR, page 18381–18391, 2023.
  46. Pise: Person image synthesis and editing with decoupled gan. In CVPR, page 7982–7990, 2021.
  47. Adding conditional control to text-to-image diffusion models. In ICCV, page 3836–3847, 2023.
  48. Exploring dual-task correlation for pose guided person image generation. In CVPR, page 7713–7722, 2022.
  49. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, page 586–595, 2018.
  50. Conditional prompt learning for vision-language models. In CVPR, page 16816–16825, 2022a.
  51. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022b.
  52. Cocosnet v2: Full-resolution correspondence learning for image translation. In CVPR, page 11465–11475, 2021.
  53. Cross attention based style distribution for controllable person image synthesis. In ECCV, page 161–178, 2022c.
  54. Progressive pose attention transfer for person image generation. In CVPR, page 2347–2356, 2019.
Citations (11)

Summary

  • The paper introduces a novel diffusion-based framework that decouples high-level appearance from pose information to improve person image synthesis.
  • It leverages a hybrid-granularity attention module to integrate multi-scale texture details, yielding superior quantitative and qualitative results.
  • The method demonstrates enhanced generalization and reduced overfitting on benchmarks like DeepFashion, indicating potential for broader digital content applications.

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis: A Formal Overview

The paper "Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis" introduces a novel diffusion-based methodology tailored for the complex task of Pose-Guided Person Image Synthesis (PGPIS). This study builds upon the recent advancements in diffusion models, leveraging them to address the limitations inherent in traditional Generative Adversarial Network (GAN)-based approaches.

Pose-Guided Person Image Synthesis (PGPIS) involves generating an image of a person in a designated pose, consistent with the appearance of a source image. This subject has broad applicability, including but not limited to film production, virtual reality, and the fashion industry. While GANs have traditionally been employed for such tasks, they have demonstrated significant shortcomings regarding the min-max training instability and the difficulty in maintaining high-fidelity image synthesis during a single generative pass. Thus, this paper leverages diffusion models—known for their progressive and computationally meticulous approach to high-resolution image synthesis—as an alternative framework.

The proposed Coarse-to-Fine Latent Diffusion (CFLD) method does not rely upon image-caption pairs, a requisite in text-to-image diffusion models like Stable Diffusion. Instead, it introduces a new paradigm featuring a perception-refined decoder to derive a semantic understanding from an image corpus directly. This decoder figures as a coarse-grained prompt provider, allowing a strategic decoupling of high-level appearance and pose information control. This decoupling prevents the overfitting issues often encountered with alignment-centric models.

A core component of this approach is the hybrid-granularity attention module, which enriches the generated images with multi-scale, fine-grained texture features. Unlike conventional models where fine details might be lost in strict alignment processes, this method complements the coarse-grained prompts with intricate textural details, thereby enhancing realism in synthesized images. Experimental outcomes on the DeepFashion benchmark demonstrate that CFLD exhibits superior performance—both quantitatively, with metrics like FID, LPIPS, SSIM, and PSNR, and qualitatively as evidenced by user studies—compared to existing state-of-the-art models.

One of the implications of the paper's CFLD framework is its demonstrated capability to prevent overfitting and enhance generalization, particularly when generating images with poses significantly divergent from the training data. This advancement suggests a notable improvement in the adaptability and reliability of PGPIS models. Future research can explore extending the CFLD methodology to other domains in image synthesis, potentially enhancing applications like animation, digital content creation, and augmented reality.

Overall, the paper presents a foundational step forward for diffusion models in PGPIS, offering a novel approach that comfortably surpasses several traditional issues related to semantic understanding and texture detail retention. By focusing on both high-level semantic and multi-scale granular attributes, the CFLD methodology may pave the way for further innovations in controllable image synthesis techniques.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube