Learning Category-Specific Mesh Reconstruction from Image Collections (1803.07549v2)

Published 20 Mar 2018 in cs.CV

Abstract: We present a learning framework for recovering the 3D shape, camera, and texture of an object from a single image. The shape is represented as a deformable 3D mesh model of an object category where a shape is parameterized by a learned mean shape and per-instance predicted deformation. Our approach allows leveraging an annotated image collection for training, where the deformable model and the 3D prediction mechanism are learned without relying on ground-truth 3D or multi-view supervision. Our representation enables us to go beyond existing 3D prediction approaches by incorporating texture inference as prediction of an image in a canonical appearance space. Additionally, we show that semantic keypoints can be easily associated with the predicted shapes. We present qualitative and quantitative results of our approach on CUB and PASCAL3D datasets and show that we can learn to predict diverse shapes and textures across objects using only annotated image collections. The project website can be found at https://akanazawa.github.io/cmr/.

Authors (4)

Angjoo Kanazawa (84 papers)
Shubham Tulsiani (71 papers)
Jitendra Malik (211 papers)
Alexei A. Efros (100 papers)

Citations (582)

View on Semantic Scholar

Summary

The paper introduces a novel method that uses a learned mean shape with instance-specific deformations to reconstruct detailed 3D objects from single annotated images.
It integrates camera parameter estimation and texture prediction through an image-sampling approach in a canonical space, enhancing reconstruction accuracy.
Evaluations on CUB and PASCAL3D datasets demonstrate competitive performance with up to 0.46 IoU, indicating promising applications in AR, robotics, and graphics.

Overview of "Learning Category-Specific Mesh Reconstruction from Image Collections"

The paper entitled "Learning Category-Specific Mesh Reconstruction from Image Collections" presents a method for reconstructing 3D shapes, estimating camera poses, and predicting textures from single 2D images. The authors propose a novel approach that leverages image annotations without relying on ground-truth 3D data or multi-view inputs for training. This work is a significant step towards efficient 3D understanding using limited supervision.

Key Contributions

The core idea is to represent object shapes using deformable 3D mesh models where each shape is expressed as a learned mean shape combined with individual predicted deformations. This strategy allows for the recovery of detailed 3D structures using only single annotated images as input for training.

Technical Approach

Shape Representation: The shape is modeled as a 3D mesh, initialized with a learned category-specific mean, adjustably deformed by instance-specific predictions. This offers advantages such as memory efficiency and internal correspondence within the predicted shapes.
Camera and Texture Prediction: The proposed framework includes learning the camera parameters along with texture inference. Texture prediction is handled as an image-sampling task in a canonical appearance space.
Training without 3D Supervision: The model is trained using a collection of annotated images with foreground masks and semantic keypoint labels, without requiring 3D ground-truth data or multiple views per instance.

Results and Evaluation

The paper demonstrates the efficacy of the proposed approach using CUB and PASCAL3D datasets, showcasing the ability to predict diverse object shapes and textures. The qualitative results illustrate successful 3D reconstructions and texture predictions, including the capability for cross-instance texture transfer. Quantitatively, the approach compares competitively with existing methods, achieving an Intersection over Union (IoU) of up to 0.46 for specific categories.

Implications and Future Directions

The method provides a promising pathway for 3D shape and texture recovery from minimal inputs, potentially enhancing applications in augmented reality, robotics, and graphics. Future research could explore:

Enhancements for handling occlusions and complex articulations.
Extending the framework to other object categories or less-structured shapes.
Incorporating additional modalities such as temporal sequences or depth hints.

Conclusion

This paper advances the domain of computer vision by providing a mechanism for mesh-based 3D reconstruction using limited annotations, challenging existing requirements for extensive multi-view or 3D data. The authors open avenues for more generalized 3D learning frameworks, thereby contributing robust tools to the field.

PDF Markdown

Related Papers

GitHub

Category-Specific Mesh Reconstruction