Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GVA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos (2402.16607v2)

Published 26 Feb 2024 in cs.CV

Abstract: In this paper, we present a novel method that facilitates the creation of vivid 3D Gaussian avatars from monocular video inputs (GVA). Our innovation lies in addressing the intricate challenges of delivering high-fidelity human body reconstructions and aligning 3D Gaussians with human skin surfaces accurately. The key contributions of this paper are twofold. Firstly, we introduce a pose refinement technique to improve hand and foot pose accuracy by aligning normal maps and silhouettes. Precise pose is crucial for correct shape and appearance reconstruction. Secondly, we address the problems of unbalanced aggregation and initialization bias that previously diminished the quality of 3D Gaussian avatars, through a novel surface-guided re-initialization method that ensures accurate alignment of 3D Gaussian points with avatar surfaces. Experimental results demonstrate that our proposed method achieves high-fidelity and vivid 3D Gaussian avatar reconstruction. Extensive experimental analyses validate the performance qualitatively and quantitatively, demonstrating that it achieves state-of-the-art performance in photo-realistic novel view synthesis while offering fine-grained control over the human body and hand pose. Project page: https://3d-aigc.github.io/GVA/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5855–5864, 2021.
  2. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7291–7299, 2017.
  3. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (ToG), 35(4):1–13, 2016.
  4. Motion2fusion: Real-time volumetric performance capture. ACM Transactions on Graphics (ToG), 36(6):1–16, 2017.
  5. On the shape of a set of points in the plane. IEEE Transactions on information theory, 29(4):551–559, 1983.
  6. Learning neural volumetric representations of dynamic humans in minutes. In CVPR, 2023.
  7. Real-time geometry, albedo, and motion reconstruction using a single rgb-d camera. ACM Transactions on Graphics (ToG), 36(4):1, 2017.
  8. The relightables: Volumetric performance capture of humans with realistic relighting. ACM Transactions on Graphics (ToG), 38(6):1–19, 2019.
  9. Geo-pifu: Geometry and pixel aligned implicit functions for single-view human reconstruction. Advances in Neural Information Processing Systems, 33:9276–9287, 2020.
  10. Gauhuman: Articulated gaussian splatting from monocular human videos. arXiv preprint arXiv:2312.02973, 2023.
  11. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pp.  559–568, 2011.
  12. Selfrecon: Self reconstruction your digital avatar from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5605–5615, 2022a.
  13. Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16922–16932, 2023.
  14. Neuman: Neural human radiance field from a single video. In European Conference on Computer Vision, pp.  402–418. Springer, 2022b.
  15. Deformable 3d gaussian splatting for animatable human avatars. arXiv preprint arXiv:2312.15059, 2023.
  16. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7122–7131, 2018.
  17. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  18. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  19. Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5253–5263, 2020.
  20. Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4501–4510, 2019.
  21. Neural human performer: Learning generalizable radiance fields for human performance rendering. Advances in Neural Information Processing Systems, 34:24741–24752, 2021.
  22. Gart: Gaussian articulated template models, 2023.
  23. Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690, 2023a.
  24. Human101: Training 100+ fps human gaussians in 100s from 1 view. arXiv preprint arXiv:2312.15258, 2023b.
  25. One-stage 3d whole-body mesh recovery with component aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21159–21168, 2023.
  26. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1954–1963, 2021.
  27. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp.  851–866. 2023.
  28. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, pp.  2640–2649, 2017.
  29. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  30. On self-contact and human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9990–9999, 2021.
  31. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  343–352, 2015.
  32. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp.  10975–10985, 2019.
  33. Animatable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  14314–14323, 2021a.
  34. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9054–9063, 2021b.
  35. Animatable implicit neural representations for creating realistic avatars from videos. TPAMI, 2024.
  36. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv preprint arXiv:2312.02069, 2023a.
  37. 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. arXiv preprint arXiv:2312.09228, 2023b.
  38. Drivable volumetric avatars using texel-aligned features. In ACM SIGGRAPH 2022 Conference Proceedings, pp.  1–9, 2022.
  39. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  2304–2314, 2019.
  40. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  84–93, 2020.
  41. Relightable gaussian codec avatars. arXiv preprint arXiv:2312.03704, 2023.
  42. Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.
  43. Very deep convolutional networks for large-scale image recognition. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  44. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European conference on computer vision (ECCV), pp.  20–36, 2018.
  45. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp.  16210–16220, 2022.
  46. Econ: Explicit clothed humans optimized via normal integration. arXiv preprint arXiv:2212.07422, 2022a.
  47. Icon: Implicit clothed humans obtained from normals. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  13286–13296. IEEE, 2022b.
  48. 3d human shape and pose from a single low-resolution image with self-supervised learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp.  284–300. Springer, 2020.
  49. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021a.
  50. Bodyfusion: Real-time capture of human motion and surface geometry using a single depth camera. In Proceedings of the IEEE International Conference on Computer Vision, pp.  910–919, 2017.
  51. Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7287–7296, 2018.
  52. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5746–5756, 2021b.
  53. Gavatar: Animatable 3d gaussian avatars with implicit mesh learning. arXiv preprint arXiv:2312.11461, 2023.
  54. Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  55. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  56. Object-occluded human shape and pose estimation from a single color image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  7376–7385, 2020.
  57. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE transactions on pattern analysis and machine intelligence, 44(6):3170–3184, 2021.
  58. Monocular real-time full body capture with inter-part correlations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4811–4822, 2021.
  59. Drivable 3d gaussian avatars. arXiv preprint arXiv:2311.08581, 2023.
Citations (8)

Summary

  • The paper presents a novel approach combining pose refinement and surface-guided Gaussian re-initialization to enhance 3D avatar fidelity.
  • It utilizes normal maps, silhouette cues, and resampling techniques to improve accuracy in challenging regions like hands and feet.
  • Evaluations on datasets such as ZJU-MoCap and People-Snapshot demonstrate superior photorealism and efficient pose control compared to previous methods.

An Examination of "GVA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos"

The paper, titled "GVA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos," introduces a method that aims to address key challenges in generating high-fidelity 3D avatars from monocular video inputs. The authors propose innovations targeting the alignment accuracy of 3D Gaussians with human skin surfaces, tackling issues of pose accuracy and unbalanced Gaussian point distributions. This paper builds upon the recent advancements in 3D Gaussian splatting and neural radiation fields to improve visual rendering and computational efficiency in avatar reconstruction.

Methodology Overview

The core contribution of the paper is a novel approach to building 3D Gaussian avatars, consisting primarily of two key enhancements: pose refinement and surface-guided Gaussian point re-initialization.

  1. Pose Refinement: This aspect focuses on increasing the precision of hand and foot poses through alignment with normal maps and silhouette cues. By leveraging these auxiliary data sources, the method enhances the initial pose estimates obtained from existing methods, thereby reducing alignment errors that commonly occur in complex regions such as the hands and feet.
  2. Surface-Guided Gaussian Re-Initialization: To counteract issues like unbalanced aggregation and initialization bias, the authors utilize a resampling technique guided by the surface mesh of human models. This involves iteratively redistributing Gaussian points to more evenly cover the target surface, thereby mitigating artifacts when avatars undergo novel pose transformations.

Results and Implications

The method, tested extensively on datasets like ZJU-MoCap and People-Snapshot, has been shown to produce avatars with enhanced fidelity and rendering performance. Quantitative metrics including PSNR, SSIM, and LPIPS demonstrate the proposed method's superiority over existing NeRF-based and Gaussian splatting-based approaches, particularly in terms of rendering photorealistic avatars and efficient pose control.

These results highlight several implications for both practical applications and future research directions. Practically, such advancements have significant implications for fields like virtual reality, digital broadcasting, and virtual try-ons, where lifelike avatars can enhance user experience and engagement. Theoretically, the paper suggests directions for improving neural representation models to handle diverse and complex dynamic poses more robustly.

Speculations on Future Developments

Considering the current trajectory of AI developments, a few speculative considerations can be drawn. As computing resources continue to evolve, the integration of physics-informed models with real-time rendering capabilities will likely be a key area of growth for methods like this. Moreover, advancements in neural rendering and learning-driven optimization may further decrease the computational time for avatar generation while enhancing the realism of synthetic characters.

Additionally, incorporating semantic understanding and intuitive interaction capabilities into these models could pave the way for more interactive and adaptive virtual beings. This would not only broaden the usability of avatars in interactive media and simulations but will also raise new questions about representation and identity in digital spaces.

In conclusion, the approach outlined in the paper marks a significant enhancement in the domain of 3D avatar reconstruction from monocular videos, offering a blend of real-time efficiency and high-fidelity reproduction that aligns with the growing demands of digital media applications. As the technology progresses, such representations will become increasingly integral in bridging the gap between human perception and digital interaction.

Github Logo Streamline Icon: https://streamlinehq.com