Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans (2012.15838v2)

Published 31 Dec 2020 in cs.CV

Abstract: This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. Some recent works have shown that learning implicit neural representations of 3D scenes achieves remarkable view synthesis quality given dense input views. However, the representation learning will be ill-posed if the views are highly sparse. To solve this ill-posed problem, our key idea is to integrate observations over video frames. To this end, we propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated. The deformable mesh also provides geometric guidance for the network to learn 3D representations more efficiently. To evaluate our approach, we create a multi-view dataset named ZJU-MoCap that captures performers with complex motions. Experiments on ZJU-MoCap show that our approach outperforms prior works by a large margin in terms of novel view synthesis quality. We also demonstrate the capability of our approach to reconstruct a moving person from a monocular video on the People-Snapshot dataset. The code and dataset are available at https://zju3dv.github.io/neuralbody/.

Citations (630)

View on Semantic Scholar

Summary

The paper presents a structured framework that anchors latent codes to a deformable human mesh to enable novel view synthesis from sparse multi-view videos.
It employs SparseConvNet and MLP-based density and color regression to propagate and refine 3D representations, achieving superior PSNR and SSIM results.
The method significantly enhances photorealism and consistency across frames, paving the way for applications in VR, film production, and telepresence.

Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans

The paper "Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans" presents a sophisticated method designed to address the challenging problem of novel view synthesis from sparse multi-view video of human performers. This work stands out in its handling of ill-posed problems associated with highly sparse camera views.

Method Overview

The authors introduce an innovative approach termed Neural Body, which strategically integrates observations across video frames by employing a shared set of latent codes tied to a deformable human mesh model. This mechanism allows for the seamless fusion of temporal data, offering significant advantages over existing methods that rely on individual frame representations.

The process consists of several key components:

Structured Latent Codes: Latent codes are anchored to the vertices of a SMPL deformable model. This enables dynamic representation adaptation to the observed human pose across video frames. The latent codes' spatial positioning is adjusted based on pose estimations to derive density and color data for novel view synthesis.
Code Diffusion and Latent Code Volume: SparseConvNet is leveraged to propagate latent codes from surface vertices into the surrounding 3D space, facilitating effective representation even in sparsely defined areas.
Density and Color Regression: Utilization of MLP networks allows for density and color prediction at arbitrary points in space. These networks are informed by latent codes, spatial location, and viewing direction, offering flexibility in rendering under varying conditions.
Volume Rendering: The methodology employs volume rendering techniques to accumulate densities and colors along camera rays, translating 3D representations into coherent 2D images for any viewpoint.

Experimental Setup and Results

The authors substantiate the efficacy of Neural Body with a newly assembled multi-view dataset, ZJU-MoCap, capturing complex human motions via 21 cameras. The results underscore the method’s superior performance in view synthesis metrics, demonstrating significant improvements in PSNR and SSIM compared to state-of-the-art techniques like Neural Volumes, Neural Textures, and models based on NeRF.

Qualitative assessments reveal that the method excels in rendering photorealistic views that maintain consistency across frames and viewpoints. Additionally, Neural Body exhibits robust capacity for detailed 3D human reconstruction, surpassing competing techniques such as PIFuHD, particularly in scenarios involving complex poses and loose clothing.

Implications and Future Work

Practically, this method holds promise for applications in virtual reality, film production, and telepresence, where high-quality, dynamic human representations are requisite. Theoretically, Neural Body contributes significantly to the field of neural rendering, presenting a scalable approach to dynamic scene representation with limited observation data.

The authors suggest potential areas for enhancement, including improving computational efficiency and exploring semi-supervised learning paradigms to augment training data. Future directions might also involve integrating this method with real-time systems to support interactive applications.

In summary, the "Neural Body" paper presents a distinct approach to solving the sparse view synthesis problem, delivering substantial improvements in rendering quality and extending the capabilities of implicit neural representations for dynamic human modeling.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/jbohnslav/status/1349032439310659591

YouTube

Show All Videos