Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deformable 3D Gaussian Splatting for Animatable Human Avatars (2312.15059v1)

Published 22 Dec 2023 in cs.CV and cs.AI

Abstract: Recent advances in neural radiance fields enable novel view synthesis of photo-realistic images in dynamic settings, which can be applied to scenarios with human animation. Commonly used implicit backbones to establish accurate models, however, require many input views and additional annotations such as human masks, UV maps and depth maps. In this work, we propose ParDy-Human (Parameterized Dynamic Human Avatar), a fully explicit approach to construct a digital avatar from as little as a single monocular sequence. ParDy-Human introduces parameter-driven dynamics into 3D Gaussian Splatting where 3D Gaussians are deformed by a human pose model to animate the avatar. Our method is composed of two parts: A first module that deforms canonical 3D Gaussians according to SMPL vertices and a consecutive module that further takes their designed joint encodings and predicts per Gaussian deformations to deal with dynamics beyond SMPL vertex deformations. Images are then synthesized by a rasterizer. ParDy-Human constitutes an explicit model for realistic dynamic human avatars which requires significantly fewer training views and images. Our avatars learning is free of additional annotations such as masks and can be trained with variable backgrounds while inferring full-resolution images efficiently even on consumer hardware. We provide experimental evidence to show that ParDy-Human outperforms state-of-the-art methods on ZJU-MoCap and THUman4.0 datasets both quantitatively and visually.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Learning to reconstruct people in clothing from a single rgb camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1175–1186, 2019.
  2. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  3. S3m: Scalable statistical shape modeling through unsupervised correspondences. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 459–469. Springer, 2023.
  4. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  5. Free-viewpoint video of human actors. ACM transactions on graphics (TOG), 22(3):569–577, 2003.
  6. Texpose: Neural texture learning for self-supervised 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4841–4852, 2023a.
  7. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11594–11604, 2021.
  8. Uv volumes for real-time rendering of editable free-view human performance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16621–16631, 2023b.
  9. High-quality streamable free-viewpoint video. ACM Transactions on Graphics (ToG), 34(4):1–13, 2015.
  10. Ppfnet: Global context aware local features for robust 3d point matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 195–205, 2018.
  11. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
  12. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (ToG), 35(4):1–13, 2016.
  13. Model globally, match locally: Efficient and robust 3d object recognition. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 998–1005. Ieee, 2010.
  14. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018.
  15. Deep learning for 3d point clouds: A survey. IEEE transactions on pattern analysis and machine intelligence, 43(12):4338–4364, 2020.
  16. Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5052–5063, 2020.
  17. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 34–50. Springer, 2016.
  18. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
  19. Humanrf: High-fidelity neural radiance fields for humans in motion. arXiv preprint arXiv:2305.06356, 2023.
  20. Selfrecon: Self reconstruction your digital avatar from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5605–5615, 2022a.
  21. Neuman: Neural human radiance field from a single video. In European Conference on Computer Vision, pages 402–418. Springer, 2022b.
  22. Housecat6d–a large-scale multi-modal category level 6d object pose dataset with household objects in realistic scenarios. arXiv preprint arXiv:2212.10428, 2022.
  23. On the importance of accurate geometry data for dense 3d vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 780–791, 2023.
  24. Virtualized reality: Constructing virtual worlds from real scenes. IEEE multimedia, 4(1):34–47, 1997.
  25. Dynamon: Motion-aware fast and robust camera localization for dynamic nerf. arXiv preprint arXiv:2309.08927, 2023.
  26. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 42(4):1–14, 2023.
  27. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60:84 – 90, 2012.
  28. Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, page 31–42, New York, NY, USA, 1996. Association for Computing Machinery.
  29. Tava: Template-free animatable volumetric actors. In European Conference on Computer Vision, pages 419–436. Springer, 2022a.
  30. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5521–5531, 2022b.
  31. Posevocab: Learning joint-structured pose embeddings for human avatar modeling. In ACM SIGGRAPH Conference Proceedings, 2023.
  32. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5741–5751, 2021.
  33. Deep functional maps: Structured prediction for dense shape correspondence. In Proceedings of the IEEE international conference on computer vision, pages 5659–5667, 2017.
  34. Neural actor: Neural free-view synthesis of human actors with pose control. ACM transactions on graphics (TOG), 40(6):1–16, 2021.
  35. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
  36. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024.
  37. Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6841–6850, 2019.
  38. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  39. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  40. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  41. Neural articulated radiance field. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5762–5772, 2021.
  42. Holoportation: Virtual 3d teleportation in real-time. In Proceedings of the 29th annual symposium on user interface software and technology, pages 741–754, 2016.
  43. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  44. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021a.
  45. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228, 2021b.
  46. Animatable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14314–14323, 2021a.
  47. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021b.
  48. Surfels: Surface elements as rendering primitives. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, page 335–342, USA, 2000. ACM Press/Addison-Wesley Publishing Co.
  49. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
  50. Geometric transformer for fast and robust point cloud registration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11143–11152, 2022.
  51. Dense depth priors for neural radiance fields from sparse input views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12892–12901, 2022.
  52. 3d is here: Point cloud library (pcl). In 2011 IEEE international conference on robotics and automation, pages 1–4. IEEE, 2011.
  53. Scanimate: Weakly supervised learning of skinned clothed avatar networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2886–2897, 2021.
  54. Bending graphs: Hierarchical shape matching using gated optimal transport. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11757–11767, 2022.
  55. Shot: Unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding, 125:251–264, 2014.
  56. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  57. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  58. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35:151–173, 1999.
  59. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 29(5):2732–2742, 2023.
  60. Surface capture for performance-based animation. IEEE computer graphics and applications, 27(3):21–31, 2007.
  61. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. Advances in Neural Information Processing Systems, 34:12278–12291, 2021.
  62. Dinar: Diffusion inpainting of neural textures for one-shot human avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7062–7072, 2023.
  63. Stereo matching with transparency and matting. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pages 517–524. IEEE, 1998.
  64. The lumigraph. In Proceedings of the 23rd annual conference on computer graphics and interactive techniques (SIGGRAPH 1996), pages 43–54, 1996.
  65. Phocal: A multi-modal dataset for category-level object pose estimation with photometrically challenging objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21222–21231, 2022a.
  66. Arah: Animatable volume rendering of articulated human sdfs. In European conference on computer vision, pages 1–19. Springer, 2022b.
  67. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9168–9178, 2021a.
  68. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
  69. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pages 16210–16220, 2022.
  70. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
  71. Surface-aligned neural radiance fields for controllable 3d human synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15883–15892, 2022.
  72. Monoperfcap: Human performance capture from monocular video. ACM Transactions on Graphics (ToG), 37(2):1–15, 2018.
  73. 4k4d: Real-time 4d view synthesis at 4k resolution, 2023.
  74. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023.
  75. Cofinet: Reliable coarse-to-fine correspondences for robust pointcloud registration. Advances in Neural Information Processing Systems, 34:23872–23884, 2021.
  76. Rotation-invariant transformer for point cloud matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5384–5393, 2023a.
  77. Monohuman: Animatable human neural field from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16943–16953, 2023b.
  78. Structured local radiance fields for human avatar modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  79. Drivable 3d gaussian avatars, 2023.
Citations (17)

Summary

  • The paper introduces a new method (ParDy-Human) that leverages deformable 3D Gaussian splatting to animate human avatars efficiently.
  • It reduces input requirements by integrating SMPL joint encoding with a deformation prediction module for high-resolution renderings.
  • Experimental results show accelerated inference on consumer-grade hardware while noting potential artifacts in uni-colored garments.

Introduction to 3D Avatars and Rendering

Creating realistic 3D human avatars from images is a significant task in visual media with applications spanning animation, virtual reality, and interactive gaming. Traditionally, generating animatable avatars has been a complex task requiring numerous camera viewpoints and particular annotations including human masks, UV maps, and depth maps.

Pioneering a New Avatar Approach

The paper introduces ParDy-Human, a novel explicit approach to generate animatable human avatars with minimal input requirements. Existing solutions may depend on dense camera views and complex annotations; ParDy-Human requires considerably fewer inputs to accomplish the task. It does so by introducing the concept of deformable 3D Gaussian Splatting, which modifies 3D Gaussians in accordance with a human pose model to animate avatars. It integrates two primary parts: The first module deals with deforming canonical 3D Gaussians following joint encodings assigned by the SMPL (Skinned Multi-Person Linear Model), while the second predicts deformations that account for dynamics beyond simple vertex manipulations.

Model Training and Efficiency

ParDy-Human can be effectively trained without human segmentation masks, using significantly fewer camera views than previous methods. Experimental evidence demonstrates its proficiency in generating realistic avatars from both densely and sparsely captured input images. Of particular note, the method is capable of generating high-resolution renderings efficiently on consumer-grade hardware.

Innovations and Contributions

This work's contributions are manifold. A powerful method for deformable 3D Gaussian splatting results in a parametrized, fully explicit representation for dynamic human avatar animation with reduced training data needs. The approach significantly accelerates inference, allowing for full-resolution renderings quickly and effectively. While highly resourceful, the paper also acknowledges limitations such as the potential for artifacts on uni-colored garments and the ethical concerns associated with digital human replication. The proposed framework nonetheless offers an intriguing and novel pathway for producing animatable human representations, setting the stage for future research and applications in the domain of avatar generation and visual effects.

X Twitter Logo Streamline Icon: https://streamlinehq.com