Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition (2302.11566v1)

Published 22 Feb 2023 in cs.CV

Abstract: We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human geometry reconstructions. We evaluate our methods on publicly available datasets and show improvements over prior art.

Citations (73)

View on Semantic Scholar

Summary

The paper Vid2Avatar presents a self-supervised method for reconstructing high-fidelity 3D human avatars directly from monocular videos captured in the wild without relying on templates or segmentation.
The method employs a dual neural field representation to effectively decompose the scene into foreground (human) and background, using canonical space for consistent human modeling and novel optimization objectives.
Experimental results show Vid2Avatar outperforms baselines in segmentation, novel view synthesis, and 3D reconstruction on standard and new datasets, enabling practical applications in AR/VR and HCI.

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition

The paper "Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition" presents a novel approach to creating detailed 3D avatars from monocular videos captured in uncontrolled environments. By implementing self-supervised scene decomposition, this method effectively addresses challenges inherent in dynamic human reconstruction without relying on pre-existing templates or supervised segmentation.

Methodology Overview

Vid2Avatar distinguishes itself by reconstructing avatars directly from video input, bypassing the need for groundtruth supervision or large datasets of human scans. The reconstruction process is based on a dual-space approach, which models both the foreground (human) and background using separate neural fields. This innovative design leverages canonical space representation for the human model alongside global optimization strategies, facilitating a clean decoupling from arbitrary backgrounds while achieving precise 3D reconstructions. The authors introduce novel objectives for opacity sparsity, enhancing segmentation accuracy by reducing unwanted background inclusion.

Technical Contributions

The technical framework of Vid2Avatar involves several key contributions:

Dual Neural Field Representation: This method utilizes separate neural fields to model the human and background, efficiently segmenting foreground dynamics from static elements within the scene.
Canonical Space Consistency: Providing a temporally consistent representation for human shape and texture, the canonical space approach allows for robust surface reconstruction, accommodating varying poses and garments.
Volume Rendering Techniques: Extending upon NeRF++, Vid2Avatar employs surface-guided volume rendering, optimizing ray accumulation to accentuate the human subject over indistinct background areas.
Self-supervised Scene Decomposition: Opting for a self-supervised approach, the authors mitigate segmentation errors by instituting opacity sparsity and ray classification losses designed to delineate subject from the scene rigorously.
Global Optimization Mechanism: The implementation of a comprehensive optimization strategy aligns background models, human shape and texture, and pose estimates across sequence frames, culminating in high-fidelity 3D avatars.

Experimental Validation

Vid2Avatar was evaluated on publicly available datasets, including MonoPerfCap, NeuMan, and 3DPW, with additional testing on the newly proposed SynWild dataset. The method showed superior performance in human segmentation, novel view synthesis, and 3D reconstruction tasks compared to baseline methods. Notably, the new dataset enables quantitative analysis of monocular human reconstruction in realistic environments, establishing Vid2Avatar as a reliable method for detailed spatial modeling.

Implications and Future Work

The advancement brought forth by Vid2Avatar significantly impacts fields such as AR/VR, human-computer interaction, and cinematography by simplifying the capture and reproduction of detailed, animated 3D avatars directly from video data. Future developments could explore enhancing capture fidelity in challenging scenarios involving loose clothing or fast-moving actions. Additionally, deeper integration with real-time application frameworks may uncover new possibilities for interactive avatar utilization.

In summary, Vid2Avatar marks a substantial progress in avatar creation technology, providing a practical solution free from extensive manual preprocessing, essential for making human digitization accessible and widespread. This model suggests promising directions for developing autonomous systems capable of managing complex visual environments without compromising the accuracy of 3D representations.

PDF Markdown

Related Papers

YouTube

Show All Videos