Shape and Viewpoint without Keypoints (2007.10982v1)

Published 21 Jul 2020 in cs.CV, cs.LG, and eess.IV

Abstract: We present a learning framework that learns to recover the 3D shape, pose and texture from a single image, trained on an image collection without any ground truth 3D shape, multi-view, camera viewpoints or keypoint supervision. We approach this highly under-constrained problem in a "analysis by synthesis" framework where the goal is to predict the likely shape, texture and camera viewpoint that could produce the image with various learned category-specific priors. Our particular contribution in this paper is a representation of the distribution over cameras, which we call "camera-multiplex". Instead of picking a point estimate, we maintain a set of camera hypotheses that are optimized during training to best explain the image given the current shape and texture. We call our approach Unsupervised Category-Specific Mesh Reconstruction (U-CMR), and present qualitative and quantitative results on CUB, Pascal 3D and new web-scraped datasets. We obtain state-of-the-art camera prediction results and show that we can learn to predict diverse shapes and textures across objects using an image collection without any keypoint annotations or 3D ground truth. Project page: https://shubham-goel.github.io/ucmr

Authors (3)

Shubham Goel (9 papers)
Angjoo Kanazawa (84 papers)
Jitendra Malik (211 papers)

Citations (105)

View on Semantic Scholar

Summary

The paper presents an unsupervised framework for 3D reconstruction that eliminates the need for keypoint annotations by using a novel camera-multiplex representation.
It employs a deformable 3D mesh and analysis-by-synthesis strategy to accurately predict shape, pose, and texture from single images.
Empirical results on datasets like CUB and Pascal 3D demonstrate state-of-the-art camera prediction accuracy and robustness against local minima issues.

An Evaluation of U-CMR: Unsupervised Category-Specific Mesh Reconstruction

The paper "Shape and Viewpoint without Keypoints" introduces an unsupervised learning framework, named Unsupervised Category-Specific Mesh Reconstruction (U-CMR), for reconstructing 3D shape, pose, and texture of objects from single images, without relying on ground truth 3D shapes, multi-view supervision, or keypoint annotations. The approach is primarily evaluated on datasets like CUB and Pascal 3D, focusing on object categories such as birds, cars, motorcycles, and a web-scraped set of shoes.

Key Contributions

U-CMR employs a novel camera distribution representation, termed "camera-multiplex," that holds multiple possible camera hypotheses optimized during training to provide the best explanation for the observed image. This stands in contrast to conventional methods that typically utilize a single point estimate of the camera. The use of a camera-multiplex is akin to maintaining a distribution over potential viewpoints and is reminiscent of particle filtering techniques.

The methodology involves leveraging a deformable 3D mesh representation and is capable of learning diverse shapes and textures from image collections via an "analysis by synthesis" paradigm. This approach distinguishes itself by not requiring keypoint supervision, which traditionally aids similar tasks by providing critical cues for camera and shape inference. Instead, the authors show how to replace keypoint needs with a single template mesh per category, effectively bypassing the labor-intensive task of marking keypoints.

Experimental Results

The authors report state-of-the-art performance in camera prediction accuracy under the limitations of weak supervision. The results highlight that U-CMR is capable of learning a plausible and diverse set of 3D shapes and textures without explicit supervision for camera viewpoints and keypoints. This is quantitatively supported as U-CMR achieves comparable performance to supervised methods employing keypoints, especially for camera estimation, as evaluated through metrics such as rotation error and camera pose distributions.

The authors also describe empirical observations where naively trained models succumb to local minima, creating degenerate "planar" shapes due to the absence of constraints from keypoints. U-CMR's camera-multiplex effectively mitigates these issues, enabling it to recover from such local minima.

Implications and Future Directions

The key implications of this research are in the domains of 3D computer vision and unsupervised learning. Practically, U-CMR could be particularly useful in contexts where collecting annotated 3D data is not feasible, allowing for more scalable and cost-effective 3D modeling solutions. Theoretically, it underscores the potential of utilizing learned priors and distributions over direct supervision to achieve complex tasks, thereby fostering further exploration of similar approaches where supervision is minimal or unavailable.

Looking ahead, expanding the framework to accommodate articulated objects or to incorporate temporal information for video inputs could enhance the adaptability and robustness of U-CMR. Moreover, further refinement in texture prediction, potentially through integration with generative models, might resolve some of the observed limitations in texture fidelity.

In conclusion, U-CMR represents a significant stride toward unsupervised 3D reconstruction and provides a compelling framework that blends innovations in learning distributional representations with traditional shape inference paradigms. The paper lays a solid foundation for subsequent advances in the construction of 3D models from 2D images, with broad implications for applications across several visual and interactive domains.

PDF Markdown

Related Papers

GitHub

U-CMR

YouTube

Show All Videos