MagicPony: Learning Articulated 3D Animals in the Wild (2211.12497v3)

Published 22 Nov 2022 in cs.CV

Abstract: We consider the problem of predicting the 3D shape, articulation, viewpoint, texture, and lighting of an articulated animal like a horse given a single test image as input. We present a new method, dubbed MagicPony, that learns this predictor purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation. At its core is an implicit-explicit representation of articulated shape and appearance, combining the strengths of neural fields and meshes. In order to help the model understand an object's shape and pose, we distil the knowledge captured by an off-the-shelf self-supervised vision transformer and fuse it into the 3D model. To overcome local optima in viewpoint estimation, we further introduce a new viewpoint sampling scheme that comes at no additional training cost. MagicPony outperforms prior work on this challenging task and demonstrates excellent generalisation in reconstructing art, despite the fact that it is only trained on real images.

Citations (63)

View on Semantic Scholar

Summary

The paper introduces MagicPony, a novel method using a hybrid neural-mesh representation and self-supervised features to reconstruct articulated 3D animal models from single images.
MagicPony employs a hybrid implicit-explicit representation combining neural fields with meshes and integrates DINO-ViT features for robust shape and pose prediction.
Evaluations show MagicPony achieves significant improvements over prior methods, particularly in reconstruction quality (Chamfer Distance, PCK) and requires substantially less supervision upon generalization.

MagicPony: Learning Articulated 3D Animals in the Wild

The authors present a novel method, MagicPony, for predicting intricate 3D shapes, articulations, viewpoints, textures, and lighting conditions of articulated animals from a single image. This method advances the current state-of-the-art in the domain of 3D model reconstruction for specific object categories, such as horses and birds, leveraging single-view images without requiring heavy geometric supervision.

MagicPony innovates by employing a hybrid representation combining neural fields and meshes, providing a balance between the expressiveness of volumetric models and the realism of explicit meshes. Central to this approach is a self-supervised vision transformer (DINO-ViT), which distills knowledge and incorporates it into the 3D model, facilitating robust inferencing of shape and pose. To address the challenge of local optima in viewpoint estimation, a novel sampling scheme is introduced, further enhancing efficiency without incurring additional training costs.

Core Methodology

Implicit-Explicit Representation:
- The object shape is encoded using a signed distance function (SDF) which is transformed into an explicit mesh using Differentiable Marching Tetrahedra. This hybrid approach captures fine geometric details while avoiding common pitfalls such as mesh folding.
Articulated Model Prediction:
- The system is designed to model instance-specific deformations and general poses using blend skinning, estimating bone rotations and vertex displacements based on image features extracted by DINO-ViT.
Self-Supervised Features:
- The fusion of DINO-ViT features into a neural representation helps in capturing self-supervised correspondences, reducing dependency on 2D keypoints or template shapes traditionally used in related works.
Viewpoint Sampling:
- MagicPony utilizes a multi-hypothesis strategy to navigate the complexity of viewpoint prediction, sampling the most probable hypotheses probabilistically based on scores predicted during training.

Empirical Evaluation

The efficacy of MagicPony is illustrated by extensive experiments on datasets comprising photos of horses, birds, and other animals, generalized from real images to abstract representations. Quantitative results indicate substantial improvements over prior methodologies in terms of Chamfer Distance (on Toy Bird Scans) and PCK scores, highlighting the superior reconstruction quality and robustness against supervision constraints.

Implications and Future Directions

The significant reduction in supervision requirements paves the way for more scalable solutions in 3D modeling, especially when dealing with diverse object categories. However, the approach does require a predefined skeleton topology, which may not be universally applicable across arbitrary objects. Future research could explore automatic topology discovery and enhancement of texture quality through advanced generative models. Additionally, the reliance on self-supervised features underscores the potential of leveraging rich pre-trained models to facilitate complex tasks in computer vision.

MagicPony holds promise for applications ranging from animation to AR/VR content creation, wherein the demand for realistic and versatile 3D objects continues to grow. The introduced techniques may serve as foundational blocks in evolving systems towards ultra-realistic and automated 3D reconstruction pipelines, especially contrasting traditional methods reliant on extensive multi-view setups.

Related Papers

YouTube

Show All Videos