A-NeRF: Articulated Neural Radiance Fields for Learning Human Shape, Appearance, and Pose (2102.06199v3)

Published 11 Feb 2021 in cs.CV and cs.GR

Abstract: While deep learning reshaped the classical motion capture pipeline with feed-forward networks, generative models are required to recover fine alignment via iterative refinement. Unfortunately, the existing models are usually hand-crafted or learned in controlled conditions, only applicable to limited domains. We propose a method to learn a generative neural body model from unlabelled monocular videos by extending Neural Radiance Fields (NeRFs). We equip them with a skeleton to apply to time-varying and articulated motion. A key insight is that implicit models require the inverse of the forward kinematics used in explicit surface models. Our reparameterization defines spatial latent variables relative to the pose of body parts and thereby overcomes ill-posed inverse operations with an overparameterization. This enables learning volumetric body shape and appearance from scratch while jointly refining the articulated pose; all without ground truth labels for appearance, pose, or 3D shape on the input videos. When used for novel-view-synthesis and motion capture, our neural model improves accuracy on diverse datasets. Project website: https://lemonatsu.github.io/anerf/ .

Citations (226)

View on Semantic Scholar

Summary

The paper introduces A-NeRF, a neural radiance field that learns human shape, pose, and appearance from monocular videos without relying on 3D templates.
It utilizes a novel skeleton-relative encoding to map 3D queries to bone-based coordinates, thereby enhancing pose estimation accuracy.
Experimental validation across multiple datasets shows improvements in metrics like PA-MPJPE, demonstrating robust performance in modeling complex articulated motions.

A-NeRF: Articulated Neural Radiance Fields for Learning Human Shape, Appearance, and Pose

The paper introduces Articulated Neural Radiance Fields (A-NeRF), a novel approach that refines and extends the existing Neural Radiance Fields (NeRFs) framework to model time-varying and articulated human motion from unlabelled monocular videos. The method achieves accurate learning of human body shape, appearance, and pose without requiring ground truth 3D labels or prior geometric templates, thus overcoming a significant limitation of many 3D models dependent on such datasets.

Core Contributions and Approach

A-NeRF integrates a skeleton-based representation to model the body's articulated nature. The work builds upon the NeRF technology, which traditionally parameterizes scenes with Multi-Layer Perceptrons (MLPs), allowing scene rendering from diverse viewpoints using volumetric rendering techniques. This approach circumvents the limitations of volumetric grids and mesh-based representations by using a flexible implicit representation of shape and color.

A-NeRF introduces a skeleton-relative encoding that contrasts with the usual forward kinematics in explicit models like SMPL. This novel reparameterization involves determining 3D query positions and view directions relative to each bone of an articulated skeleton, addressing the ill-posed problem of mapping world coordinates to these reference coordinates. Consequently, this approach enhances the robustness of the model against noise in the initial pose estimations and eliminates the need for pre-existing templates.

Experimental Validation and Results

The authors validate A-NeRF on multiple datasets, including Human 3.6M, MPI-INF-3DHP, and real-world monocular sequences from MonoPerfCap. They report improvements in PA-MPJPE, particularly for difficult-to-estimate joints like wrists and ankles, achieving respectable advancements over baselines like SPIN. The refinement of poses shows a reduction in PA-MPJPE from initialization by significant margins, indicating the model's effectiveness in refining articulation from 2D input alone.

The visual quality is assessed using PSNR and SSIM metrics, revealing superior generation quality compared to previous methods such as NeuralBody, especially in cases with complex motions. Notably, the model proves competent in volumetric reconstruction from monocular video, learning detailed geometry without surface meshes or calibration—a feat traditionally challenging for single-view methods.

Implications and Future Directions

A-NeRF's capacity to learn detailed geometry and articulate motion from 2D videos suggests several practical applications, from motion capture systems to VR/AR content creation. The introduction of a skeleton-relative encoding opens avenues for implicit models to handle articulated motions more naturally, suggesting further exploration of neural radiance fields in dynamic scenarios.

The authors propose further expansion of this approach to generalize across multiple subjects beyond per-individual models, potentially refining its application to more generalized scenarios without needing extensive subject-specific data. As a future direction, combining dynamic neural rendering with physical lighting models would enhance the realism and applicability of these models in environments with varying illumination.

Conclusion

The A-NeRF framework represents a significant stride in leveraging neural rendering for motion capture from monocular video without relying on explicit geometric templates or multi-view data. The incorporation of skeleton-relative coordinates not only addresses significant challenges in modeling articulated bodies but also offers a bridge between traditional discriminative approaches and modern implicit models. As implicit representation methods like A-NeRF evolve, their role in democratizing access to high-quality 3D motion and shape data from readily available 2D sequences will likely expand, impelling both theoretical advancement and practical implementation.

PDF Markdown