Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans (2404.00485v1)

Published 30 Mar 2024 in cs.CV

Abstract: We present DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast, DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image, which allows us to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. During inference, we may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore, we introduce a generator neural network that approximates rendering with considerably reduced runtime (55x speed up), resulting in a novel dual-branch diffusion framework. Our experiments show that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image, while remaining competitive with the state-of-the-art when reconstructing visible surfaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Detailed human avatars from monocular video. In Proceedings of the Interntional Conference on 3D Vision (3DV), 2018a.
  2. Video based reconstruction of 3D people models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018b.
  3. Learning to reconstruct people in clothing from a single rgb camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
  4. Tex2shape: Detailed full human body geometry from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2293–2303, 2019b.
  5. Photorealistic monocular 3D reconstruction of humans wearing clothing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  6. Scape: shape completion and animation of people. SIGGRAPH, 2005.
  7. 3d multi-bodies: Fitting sets of plausible 3d human models to ambiguous image data. Advances in Neural Information Processing Systems, 33:20496–20507, 2020.
  8. Christopher M. Bishop. Mixture Density Networks. Technical report, Aston University, 1994.
  9. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  10. Structured 3d features for reconstructing relightable and animatable avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  11. Learning dynamic relationships for 3d human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6519–6527, 2020.
  12. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  13. Diffpose: Toward more reliable 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13041–13051, 2023.
  14. Implicit geometric regularization for learning shapes. ICML, 2020.
  15. Holopose: Holistic 3D human reconstruction in-the-wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  16. John C Hart. Sphere tracing: A geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer, 12(10):527–545, 1996.
  17. Geo-pifu: Geometry and pixel aligned implicit functions for single-view human reconstruction. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  18. Arch++: Animation-ready clothed human reconstruction revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  19. Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
  20. Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020.
  21. Sherf: Generalizable human nerf from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  22. One-shot implicit animatable avatars with model-based priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8974–8985, 2023.
  23. TeCH: Text-guided Reconstruction of Lifelike Clothed Humans. In Proceedings of the International Conference on 3D Vision (3DV), 2024.
  24. Arch: Animatable reconstruction of clothed humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  25. Humanrf: High-fidelity neural radiance fields for humans in motion. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
  26. Neuman: Neural human radiance field from a single video. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  27. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018.
  28. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  29. PARE: Part attention regressor for 3D human body estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  30. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2252–2261, 2019.
  31. Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11605–11614, 2021.
  32. 360-degree textures of people in clothing from a single image. In Proceedings of the International Conference on 3D Vision (3DV), pages 643–653. IEEE, 2019.
  33. Generating multiple hypotheses for 3d human pose estimation with mixture density network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9887–9895, 2019.
  34. Posefusion: Pose-guided selective fusion for single-view human volumetric capture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14162–14172, 2021.
  35. Posevocab: Learning joint-structured pose embeddings for human avatar modeling. In ACM SIGGRAPH Conference Proceedings, 2023.
  36. Smpl: A skinned multi-person linear model. ToG, 2015.
  37. Marching cubes: A high resolution 3D surface construction algorithm. SIGGRAPH, 1987.
  38. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2837–2845, 2021.
  39. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
  40. Occupancy networks: Learning 3D reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  41. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  42. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  43. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4328–4338, 2023.
  44. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In Proceedings of the International Conference on 3D Vision (3DV), 2018.
  45. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
  46. Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 459–468, 2018.
  47. RenderPeople Dataset. Renderpeople dataset. https://renderpeople.com/.
  48. High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022.
  49. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
  50. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 35:36479–36494, 2022.
  51. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  52. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  53. Synthetic training for accurate 3d human pose and shape estimation in the wild. In Proceedings of the British Machine Vision Conference (BMVC), 2020.
  54. Hierarchical kinematic probability distributions for 3D human shape and pose estimation from images in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11219–11229, 2021.
  55. Humaniflow: Ancestor-conditioned normalising flows on SO (3) manifolds for human pose and shape distribution estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4779–4789, 2023.
  56. Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  57. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  58. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  59. Viewset diffusion: (0-)image-conditioned 3D generative models from 2D data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  60. Indirect deep structured learning for 3D human shape and pose prediction. In Proceedings of the British Machine Vision Conference (BMVC), 2017.
  61. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  62. Probabilistic monocular 3d human pose estimation with normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11199–11208, 2021.
  63. HumanNeRF: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16210–16220, 2022.
  64. Monoclothcap: Towards temporally coherent clothing capture from monocular rgb video. In Proceedings of the International Conference on 3D Vision (3DV), pages 322–332. IEEE, 2020.
  65. Icon: Implicit clothed humans obtained from normals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  66. ECON: Explicit Clothed humans Optimized via Normal integration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  67. Ghum & ghuml: Generative 3D human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  68. H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  69. D-if: Uncertainty-aware human digitization via implicit distribution field. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9122–9132, 2023.
  70. Multiview neural surface reconstruction by disentangling geometry and appearance. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  71. Neural descent for visual 3d human pose and shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14484–14493, 2021a.
  72. Thundr: Transformer-based 3D human reconstruction with markers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
  73. Lion: Latent point diffusion models for 3d shape generation. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  74. Danet: Decompose-and-aggregate network for 3D human shape and pose estimation. In Proceedings of the 27th ACM International Conference on Multimedia, pages 935–944, 2019.
  75. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  76. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE Conf. Comput. Vis. Pattern Recog., 2018.
  77. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. PAMI, 2021.
  78. Detailed human shape estimation from a single image by hierarchical mesh deformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Citations (3)

Summary

  • The paper introduces a probabilistic diffusion model that generates diverse, high-fidelity 3D human reconstructions from a single RGB image.
  • It employs a dual-branch diffusion framework that accelerates the process by 55x while improving geometric detail and color accuracy.
  • Empirical results demonstrate that DiffHuman outperforms state-of-the-art methods, offering realistic avatars for applications like VR, gaming, and digital content production.

Unveiling DiffHuman: A Probabilistic Approach to Photorealistic 3D Human Reconstruction from a Single Image

Introduction

The task of photorealistic 3D human reconstruction from a single RGB image presents an ill-posed yet vital challenge for various applications, including virtual and mixed reality, gaming, and digital content production. Traditional deterministic methods approach this problem by generating a single solution, which can lead to blurred details and inaccurate reconstructions, particularly in regions that are not visible in the input image. This paper introduces a novel probabilistic method, termed DiffHuman, which leverages conditional denoising diffusion models to generate multiple plausible and detailed 3D human reconstructions from a single input image, significantly enhancing the quality and diversity of the generated models.

Key Contributions

DiffHuman's methodological advancements and contributions are multi-faceted and can be encapsulated in the following points:

  • Introduction of a probabilistic diffusion model tailored for the photorealistic 3D reconstruction of humans, capable of generating a distribution of plausible reconstructions conditioned on a single input image.
  • A novel dual-branch diffusion framework that integrates an image generation network, significantly reducing the runtime by a factor of 55 compared to traditional rendering-based approaches.
  • Empirical demonstrations of the model's ability to enhance the geometric detail and color accuracy in unseen or uncertain regions of the person in the input image, surpassing the existing state-of-the-art methods in quality.

Method Overview

At the heart of DiffHuman is a conditional diffusion model that intricately represents the intricate process from a noisy observation to a clean (denoised) 3D representation. This is achieved through a series of forward and reverse diffusion steps that iteratively refine the prediction. Unlike existing models that predominantly rely on one-way deterministic mappings, DiffHuman treats the reconstruction problem as a distribution prediction task. This allows for the sampling of diverse 3D avatars that exhibit high fidelity to the original input while also covering variations that are plausible but not directly observable in the input.

Implications and Future Directions

The probabilistic nature of DiffHuman introduces a paradigm shift in 3D human reconstruction. The ability to produce multiple detailed avatars from a single 2D image opens up new avenues for personalized content creation and interactive applications where user engagement benefits from variability and realism. Moreover, the efficiency gains from the dual-branch diffusion framework suggest that such probabilistic models can be feasibly integrated into real-time applications, breaking the computational barriers associated with high-quality 3D reconstruction.

Looking ahead, further exploration into refining the diffusion process and integration with other modalities (e.g., text descriptors or partial depth information) could enhance the model's applicability and accuracy. Additionally, investigating unsupervised or semi-supervised training regimes might alleviate the reliance on large-scale annotated datasets and expand the scope of reconstructable human poses and appearances.

Conclusion

DiffHuman represents a significant step forward in the photorealistic 3D reconstruction of humans from monocular images. By embracing a probabilistic modeling approach, it not only addresses the intrinsic ambiguities present in this task but also enriches the toolbox available for digital human modeling with a flexible and efficient solution. Moving forward, the continued development in this direction promises to further bridge the gap between the virtual and real, enriching digital experiences with more lifelike human avatars.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com