Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synthesizing Moving People with 3D Control (2401.10889v2)

Published 19 Jan 2024 in cs.CV and cs.AI

Abstract: In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5223–5232, 2020.
  2. Video based reconstruction of 3d people models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8387–8397, 2018.
  3. Gesturediffuclip: Gesture diffusion model with clip latents. arXiv preprint arXiv:2303.14613, 2023.
  4. Video rewrite: Driving visual speech with audio. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 715–722. 2023.
  5. Hallucinating pose-compatible scenes. In European Conference on Computer Vision, 2022.
  6. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  7. Smplitex: A generative model and dataset for 3d human texture estimation from single image. arXiv preprint arXiv:2309.01855, 2023.
  8. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5933–5942, 2019.
  9. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023.
  10. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018.
  11. High-fidelity 3d human digitization from single 2k resolution images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  12. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  13. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010.
  14. Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025, 2023.
  15. Putting people in their place: Affordance-aware human insertion into scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  16. Sitta: Single image texture translation for data augmentation. In European Conference on Computer Vision, pages 3–20. Springer, 2022.
  17. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
  18. Accurate 3d hand pose estimation for whole-body 3d human mesh estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2308–2317, 2022.
  19. Tracking people by predicting 3d appearance, location and pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2740–2749, 2022.
  20. High-resolution image synthesis with latent diffusion models. 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  21. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022a.
  22. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022b.
  23. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  24. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  25. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
  26. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  27. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  28. Disco: Disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040, 2023.
  29. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
  30. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  31. Diffusion-hpc: Generating synthetic images with realistic humans. arXiv preprint arXiv:2303.09541, 2023.
  32. 3d human texture estimation from a single image with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13849–13858, 2021.
  33. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), 2021.
  34. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  35. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
Citations (1)

Summary

  • The paper introduces the 3DHM framework that synthesizes full-body texture maps and realistic poses from a single image.
  • It employs an in-filling diffusion model to generate unseen body parts and a diffusion-based rendering pipeline to simulate fabric motion and hair dynamics.
  • Experimental results show that 3DHM outperforms state-of-the-art methods by achieving temporally coherent and highly detailed human animations.

Introduction

The recently developed framework known as 3DHM is setting a new standard in the field of computer vision and graphics. Aimed at pushing the boundaries of how we animate images, this cutting-edge framework synthesizes human motions from a single static image by invoking a two-part process that firstly learns to in-paint texture maps and then renders the human form in novel poses with enhanced realism.

Learning to Visualize the Unseen

The initial stage of 3DHM focuses on creating a complete picture—or texture map—of an individual from a single image. Usually, when we have a photograph of someone, it gives us just a partial view, such as the front or back. However, animating a person in 3D requires a full texture map of their entire body. To address this, the 3DHM framework utilizes an in-filling diffusion model that hallucinates the unseen portions of the individual’s clothing and body by learning from available data. The strength of this approach lies in its invariance to human pose and viewpoint, which makes it efficient in learning from a limited set of samples.

Rendering Realism in Motion

The second phase is where 3DHM really showcases its unique abilities by transforming the intermediate texture-mapped figure into lifelike poses that include realistic fabric physics, such as the swirl of a skirt, the ripple of a shirt, or the sway of hair. This is accomplished through a diffusion-based rendering pipeline that takes into account 3D human poses. While many existing methods might struggle to capture such nuances or require large datasets, 3DHM achieves fidelity in the rendered image with 3D control, producing temporally coherent animations of complex human actions.

Experimental Validation and Results

The validation of 3DHM involved extensive testing and comparison with other state-of-the-art methods. The framework was trained and tested on a vast array of videos and evaluated using several different metrics. Based on these metrics, 3DHM demonstrated superiority in generating both individual frames and entire sequences of human motion while capturing the intricate details of appearance and pose with high accuracy. Notably, even when applied to previously unseen images and videos, including those from random YouTube content, it was able to animate the person with impressive realism.

Technological Implications and Future Directions

The advent of 3DHM doesn't just demonstrate a remarkable ability to synthesize human motion from still imagery; it represents a leap forward in our capability to animate photographs and generate realistic videos. There do, however, remain opportunities for improvement, such as ensuring consistency across a video sequence and the inability to perfectly capture unique attributes like logos on clothing due to the dataset limitations. Despite these challenges, 3DHM's ability to operate with limited data and its advancements in realistic rendering positions it as a potential game-changer in areas ranging from virtual reality to film production. Its methodological approach sidesteps the need for vast data troves and instead leverages the power of 3D rendering and machine learning in creative new ways, offering a glimpse into the future of human animation technology.

Github Logo Streamline Icon: https://streamlinehq.com