BANMo: Building Animatable 3D Neural Models from Many Casual Videos (2112.12761v3)

Published 23 Dec 2021 in cs.CV and cs.GR

Abstract: Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articulated 3D models (including shape and animatable skinning weights) from many monocular casual videos in a differentiable rendering framework. While the use of many videos provides more coverage of camera views and object articulations, they introduce significant challenges in establishing correspondence across scenes with different backgrounds, illumination conditions, etc. Our key insight is to merge three schools of thought; (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization, and (3) canonical embeddings that generate correspondences between pixels and an articulated model. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints and poses. Project webpage: banmo-www.github.io .

Citations (160)

View on Semantic Scholar

Summary

The paper introduces a unified framework that integrates neural implicit representations with canonical embeddings to generate animatable 3D models from casual videos.
The approach leverages neural blend skinning to manage pose variations and non-rigid deformations without relying on pre-registered cameras or templates.
Experimental results show superior fidelity in reconstructing human and animal models, outperforming existing dynamic NeRF methods.

Overview of BANMo: Building Animatable 3D Neural Models from Many Casual Videos

The paper "BANMo: Building Animatable 3D Neural Models from Many Casual Videos" introduces a novel framework designed to generate high-fidelity, animatable 3D models derived from casual RGB videos. This approach diverges from previous methodologies that often necessitated specialized sensors or pre-existing 3D model templates, which are impractical for varied real-world datasets.

The method—acronymed as BANMo—strategically amalgamates methodologies across different domains, namely deformable shape models, canonical embeddings, and neural radiance fields (NeRFs). This integration facilitates the creation of detailed 3D models by capitalizing on differentiable rendering within a neural optimization framework.

Core Contributions

Unified Framework: BANMo seamlessly blends traditional articulated models with neural implicit representations and canonical embeddings. This integration is critical for establishing dense correspondences across video frames in a self-supervised manner, an approach that bypasses the need for pre-registered cameras or template models.
Neural Blend Skinning: The paper introduces neural blend skinning models that enhance the differentiable and invertible nature of articulated deformations. This advancement enhances the model's ability to handle pose variations and deformations without necessitating known camera parameters, thereby extending beyond the capabilities of existing dynamic NeRF approaches.
Enhanced Fidelity: Experimental evaluations demonstrate that BANMo achieves higher fidelity reconstructions on both human and animal datasets, outperforming previous methods. The framework supports the creation of realistic renders from novel viewpoints, showcasing the utility and scope of BANMo in virtual and augmented reality content creation.
Canonical Embeddings and Registration: By optimizing a canonical feature embedding for dense matching across video frames, BANMo establishes robust pixel-level correspondences that are key to accurate reconstructions. This approach is bolstered by self-supervised learning mechanisms, ensuring reliable geometric and photometric consistency across varying conditions.

Numerical Results and Claims

The paper substantiates its claims through rigorous experimental validation on real and synthetic datasets. Quantitatively, BANMo achieves superior 3D shape details when compared to state-of-the-art techniques. These results underscore the framework's advantage in rendering and fidelity, particularly in handling non-rigid objects.

Implications and Future Directions

The research presents meaningful implications for the development of AI-driven applications in content creation for virtual environments. The framework's ability to construct accurate and animatable 3D models from casually collected videos represents a significant leap towards democratizing access to powerful modeling tools.

Practically, the implications extend to enhanced virtual reality (VR) and augmented reality (AR) applications, allowing users to 3D-ify content seamlessly. Theoretically, this work paves the way for further exploration into more generalized algorithms that could handle even broader object categories with minimal user intervention.

Future developments could focus on optimizing the computational efficiency of BANMo, potentially reducing the reliance on extensive computing power. Advancement in root pose estimation without pretrained models would also mark significant progress in simplifying the pipeline further.

In summary, the BANMo framework represents a substantive step forward in the field of 3D modeling from casual video inputs, offering insights and directions for both theoretical exploration and practical application within AI and computer vision fields.

PDF Markdown