Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis (2402.18078v2)
Abstract: Diffusion model is a promising approach to image generation and has been employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive performance. While existing methods simply align the person appearance to the target pose, they are prone to overfitting due to the lack of a high-level semantic understanding on the source person image. In this paper, we propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence of image-caption pairs and textual prompts, we develop a novel training paradigm purely based on images to control the generation process of a pre-trained text-to-image diffusion model. A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt. This allows for the decoupling of fine-grained appearance and pose information controls at different stages, and thus circumventing the potential overfitting problem. To generate more realistic texture details, a hybrid-granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. Both quantitative and qualitative experimental results on the DeepFashion benchmark demonstrate the superiority of our method over the state of the arts for PGPIS. Code is available at https://github.com/YanzuoLu/CFLD.
- Person image synthesis via denoising diffusion model. In CVPR, page 5968–5976, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
- Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, page 7291–7299, 2017.
- Imagenet: A large-scale hierarchical image database. In CVPR, page 248–255, 2009.
- Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
- A variational u-net for conditional appearance and shape generation. In CVPR, page 8857–8866, 2018.
- Taming transformers for high-resolution image synthesis. In CVPR, page 12873–12883, 2021.
- Densepose: Dense human pose estimation in the wild. In CVPR, pages 7297–7306, 2018.
- Controllable person image synthesis with pose-constrained latent diffusion. In ICCV, page 22768–22777, 2023.
- Deep residual learning for image recognition. In CVPR, page 770–778, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, page 16000–16009, 2022.
- Prompt-to-prompt image editing with cross attention control. arXiv:2208.01626, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Dense intrinsic appearance flow for human pose transfer. In CVPR, page 3693–3702, 2019.
- Design guidelines for prompt engineering text-to-image generative models. In CHI, pages 1–23, 2022.
- Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In ICCV, pages 5904–5913, 2019.
- Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, page 1096–1104, 2016.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, page 10012–10022, 2021.
- Learning semantic person image generation by region-adaptive normalization. In CVPR, page 10806–10815, 2021.
- Pose guided person image generation. In NeurIPS, 2017.
- Disentangled person image generation. In CVPR, pages 99–108, 2018.
- Controllable person image synthesis with attribute-decomposed gan. In CVPR, page 5084–5093, 2020.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv:2302.08453, 2023.
- Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019.
- Best prompts for text-to-image models and how to find them. In SIGIR, pages 2067–2071, 2023.
- Learning transferable visual models from natural language supervision. In ICML, page 8748–8763, 2021.
- Deep image spatial transformation for person image generation. In CVPR, page 7690–7699, 2020.
- Neural texture extraction and distribution for controllable person image synthesis. In CVPR, page 13535–13544, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, page 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, page 234–241, 2015.
- Improved techniques for training gans. NeurIPS, 29, 2016.
- Style and pose control for image synthesis of humans from a single monocular view. arXiv:2102.11263, 2021.
- Deformable gans for pose-based human image generation. In CVPR, page 3408–3416, 2018.
- Denoising diffusion implicit models. In ICLR, 2021a.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
- Xinggan for person image generation. In ECCV, page 717–734, 2020.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930, 2023.
- Leonid Nisonovich Vaserstein. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72, 1969.
- Attention is all you need. In NeurIPS, 2017.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Image quality assessment: From error visibility to structural similarity. TIP, 13(4):600–612, 2004.
- Paint by example: Exemplar-based image editing with diffusion models. In CVPR, page 18381–18391, 2023.
- Pise: Person image synthesis and editing with decoupled gan. In CVPR, page 7982–7990, 2021.
- Adding conditional control to text-to-image diffusion models. In ICCV, page 3836–3847, 2023.
- Exploring dual-task correlation for pose guided person image generation. In CVPR, page 7713–7722, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, page 586–595, 2018.
- Conditional prompt learning for vision-language models. In CVPR, page 16816–16825, 2022a.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022b.
- Cocosnet v2: Full-resolution correspondence learning for image translation. In CVPR, page 11465–11475, 2021.
- Cross attention based style distribution for controllable person image synthesis. In ECCV, page 161–178, 2022c.
- Progressive pose attention transfer for person image generation. In CVPR, page 2347–2356, 2019.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.