Self-supervised Single-view 3D Reconstruction via Semantic Consistency (2003.06473v1)

Published 13 Mar 2020 in cs.CV

Abstract: We learn a self-supervised, single-view 3D reconstruction model that predicts the 3D mesh shape, texture and camera pose of a target object with a collection of 2D images and silhouettes. The proposed method does not necessitate 3D supervision, manually annotated keypoints, multi-view images of an object or a prior 3D template. The key insight of our work is that objects can be represented as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on birds and wheels on cars). Therefore, by leveraging self-supervisedly learned part segmentation of a large collection of category-specific images, we can effectively enforce semantic consistency between the reconstructed meshes and the original images. This significantly reduces ambiguities during joint prediction of shape and camera pose of an object, along with texture. To the best of our knowledge, we are the first to try and solve the single-view reconstruction problem without a category-specific template mesh or semantic keypoints. Thus our model can easily generalize to various object categories without such labels, e.g., horses, penguins, etc. Through a variety of experiments on several categories of deformable and rigid objects, we demonstrate that our unsupervised method performs comparably if not better than existing category-specific reconstruction methods learned with supervision.

Authors (7)

Xueting Li (32 papers)
Sifei Liu (64 papers)
Kihwan Kim (67 papers)
Shalini De Mello (45 papers)
Varun Jampani (125 papers)
Ming-Hsuan Yang (377 papers)
Jan Kautz (215 papers)

Citations (153)

View on Semantic Scholar

Summary

Self-supervised Single-view 3D Reconstruction via Semantic Consistency

The paper "Self-supervised Single-view 3D Reconstruction via Semantic Consistency" introduces a novel approach to address the challenge of reconstructing 3D shapes, textures, and camera poses from single-view images using self-supervision. This approach circumvents traditional dependencies on annotated 3D data, keypoints, or multi-view images by leveraging the semantic coherence of object parts across different instances of the same category.

Methodological Overview

The core insight underpinning this research is that objects can be viewed as a collection of semantically consistent parts, such as wings on birds or wheels on cars. The authors propose a framework that employs self-supervised co-part segmentation to decompose 2D images into consistent semantic parts. This is achieved through the use of SCOPS (Self-supervised Co-Part Segmentation), which identifies semantic segments across a vast collection of category-specific images.

The reconstruction pipeline consists of learning a single-view 3D reconstruction model that involves predicting the 3D mesh shape, texture, and camera pose. The innovative aspect of the approach lies in its self-supervised nature, which primarily relies on enforcing semantic consistency between 2D images and their reconstructed 3D meshes. By ensuring that semantic part labels remain invariant on these reconstructions, the framework effectively mitigates the "camera-shape ambiguity" problem, wherein predicted shape and pose may lead to plausible renderings without reflecting the true 3D structure.

Strong Numerical Results

One notable aspect of this work is the demonstration of the proposed method's performance, which closely matches or surpasses traditional supervised category-specific reconstruction methodologies. The authors report competitive results in several categories of both rigid and non-rigid objects, without requiring explicit geometrical templates or annotations, which are conventionally used in supervised learning frameworks.

Implications and Future Directions

The implications of this research are multifold. Theoretically, it pushes the boundaries of understanding the potential of self-supervised learning in overcoming data annotation challenges, particularly in 3D vision. Practically, it provides a viable pathway for deploying 3D reconstruction models in scenarios where labeled data is sparse or unavailable, enhancing applications in fields such as robotics, augmented reality, and computer graphics.

In terms of future developments, this approach paves the way for exploring more generalized frameworks that can handle diverse object categories beyond the rigid and deformable dichotomy. Additionally, integrating this system with advanced differentiable rendering techniques could further enhance accuracy and robustness. Expanding the ability of models to learn from minimal data while maintaining high fidelity in 3D reconstructions remains an intriguing area of research.

The potential of this framework to generalize across various object categories suggests possible integrations with other learning paradigms, such as semi-supervised or unsupervised learning, to refine and improve object part segmentation and alignment in complex scenes.

PDF Markdown

Related Papers

YouTube

Show All Videos