Animatable Implicit Neural Representations for Creating Realistic Avatars from Videos (2203.08133v4)

Published 15 Mar 2022 in cs.CV

Abstract: This paper addresses the challenge of reconstructing an animatable human model from a multi-view video. Some recent works have proposed to decompose a non-rigidly deforming scene into a canonical neural radiance field and a set of deformation fields that map observation-space points to the canonical space, thereby enabling them to learn the dynamic scene from images. However, they represent the deformation field as translational vector field or SE(3) field, which makes the optimization highly under-constrained. Moreover, these representations cannot be explicitly controlled by input motions. Instead, we introduce a pose-driven deformation field based on the linear blend skinning algorithm, which combines the blend weight field and the 3D human skeleton to produce observation-to-canonical correspondences. Since 3D human skeletons are more observable, they can regularize the learning of the deformation field. Moreover, the pose-driven deformation field can be controlled by input skeletal motions to generate new deformation fields to animate the canonical human model. Experiments show that our approach significantly outperforms recent human modeling methods. The code is available at https://zju3dv.github.io/animatable_nerf/.

Citations (18)

View on Semantic Scholar

Summary

The paper introduces a pose-driven deformation field integrated with a canonical neural radiance field to enable explicit skeletal motion control.
It leverages neural blend weight fields and signed distance fields to enhance geometry learning and reduce noise in the avatar reconstruction process.
Experimental results on datasets like Human3.6M and ZJU-MoCap demonstrate superior image synthesis and 3D shape generation with higher PSNR and SSIM values.

Animatable Implicit Neural Representations for Creating Realistic Avatars from Videos

The paper explores reconstructing animatable human models using implicit neural representations from multi-view videos. This approach addresses existing challenges in rendering realistic avatars without relying on complex hardware setups or extensive manual interventions, which are typical in traditional modeling pipelines.

Core Methodological Contributions

At the heart of this paper is the decomposition of a dynamically deforming human body into two components: a canonical neural radiance field and a pose-driven deformation field. This framework contrasts with existing methods that often utilize translational or SE(3) vector fields, which are prone to under-constrained optimization issues and lack explicit control via input motions. The proposed methodology aims to alleviate these shortcomings through several innovative approaches:

Pose-Driven Deformation Field: By leveraging the linear blend skinning (LBS) algorithm, the paper introduces a deformation field that integrates blend weight fields with 3D skeletons. This technique not only facilitates observation-to-canonical correspondences but also allows explicit skeletal motion control to animate the canonical model.
Neural and Pose-Dependent Fields: The paper explores neural blend weight fields, optimized for accuracy in representing deformations which the SMPL model might not capture, particularly for clothing and non-rigid deformations.
Implicit Neural Representations: Utilizing signed distance fields (SDF) along with canonical neural fields, the proposed method enhances geometry learning by providing well-defined zero-level surfaces that facilitate noise-reduction in geometry representation.

Experimental Validation

The experimental sections establish that the proposed method surpasses contemporary human modeling techniques on datasets such as Human3.6M, MonoCap, and ZJU-MoCap. Key results demonstrate superior performance in both image synthesis and 3D reconstruction tasks. Notably, the method attains significant accuracy improvements in novel view synthesis and 3D shape generation under novel human poses, as indicated by higher PSNR and SSIM values compared to baselines.

Theoretical and Practical Implications

The implications of this research span practical implementations in areas requiring high-fidelity human models like video games, virtual reality, and telepresence systems. Theoretically, the integration of SDF with implicit neural representations offers a promising avenue for more stable and defined geometry learning. Additionally, pose-driven deformation fields provide a robust framework for representing complex deformations without the need for extensive manual adjustments.

Future Directions

The paper opens several avenues for future research, notably in the field of generalizing the learned neural representations across different subjects and optimizing for quicker model convergence. Exploration into more efficient ways of learning non-rigid deformations, particularly for complex garment deformations, remains a potential area for further innovation.

Conclusion

The proposed method marks an advancement in creating animatable avatars by addressing fundamental limitations in current implicit modeling techniques. Through novel mathematical formulations and architectural innovations, this work significantly improves the rendering quality and efficiency of human modeling tasks from videos. The insights gathered here could inform subsequent research and development within the field of computer vision and graphics, particularly concerning animatable neural representations.

PDF Markdown