MarrNet: 3D Shape Reconstruction via 2.5D Sketches (1711.03129v1)

Published 8 Nov 2017 in cs.CV, cs.LG, and cs.NE

Abstract: 3D object reconstruction from a single image is a highly under-determined problem, requiring strong prior knowledge of plausible 3D shapes. This introduces challenges for learning-based approaches, as 3D object annotations are scarce in real images. Previous work chose to train on synthetic data with ground truth 3D information, but suffered from domain adaptation when tested on real data. In this work, we propose MarrNet, an end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape. Our disentangled, two-step formulation has three advantages. First, compared to full 3D shape, 2.5D sketches are much easier to be recovered from a 2D image; models that recover 2.5D sketches are also more likely to transfer from synthetic to real data. Second, for 3D reconstruction from 2.5D sketches, systems can learn purely from synthetic data. This is because we can easily render realistic 2.5D sketches without modeling object appearance variations in real images, including lighting, texture, etc. This further relieves the domain adaptation problem. Third, we derive differentiable projective functions from 3D shape to 2.5D sketches; the framework is therefore end-to-end trainable on real images, requiring no human annotations. Our model achieves state-of-the-art performance on 3D shape reconstruction.

Citations (406)

View on Semantic Scholar

Summary

The paper introduces a novel two-stage model that first recovers 2.5D sketches and then refines them into robust 3D shape reconstructions.
It employs an encoder-decoder architecture with differentiable reprojection loss to bridge the gap between synthetic and real-world images.
Experimental results show improved IoU and visual quality across datasets, though challenges remain with thin or complex structures.

Overview of "MarrNet: 3D Shape Reconstruction via 2.5D Sketches"

The paper introduces MarrNet, an innovative approach to the problem of reconstructing 3D shapes from single 2D images, a notoriously under-determined challenge within computer vision. The authors present an end-to-end trainable model that decomposes the task into two distinct stages: the recovery of a 2.5D sketch, made up of depth and surface normal maps, followed by the reconstruction of a 3D shape from these sketches. This methodological separation offers significant advantages for handling domain adaptation issues typically encountered when training models on synthetic versus real-world data.

Technical Approach

The MarrNet framework is structured in three parts:

2.5D Sketch Estimation: This initial stage extracts intrinsic geometric properties from single-view RGB images by predicting depth, surface normals, and silhouettes. The encoder-decoder architecture used in this phase serves to abstract non-essential details such as texture and lighting.
3D Shape Estimation: The second part refines the 3D structure based on the recovered 2.5D sketches, utilizing a voxel representation. This stage relies solely on synthetically generated data due to its relative independence of appearance, thus mitigating domain adaptation challenges.
Reprojection Consistency: This key innovation enforces a constraint between the estimated 3D shape and the 2.5D sketches through differentiable loss functions. The constraints ensure that the reconstructed 3D shape aligns accurately with the projective properties of the 2.5D sketches.

The model's training paradigm is a two-step process; initially, the components are trained individually on synthetic datasets (e.g., ShapeNet), followed by fine-tuning on real image datasets (e.g., PASCAL 3D+) using self-supervised learning methods without requiring ground truth annotations.

Experimental Evaluation

The efficacy of MarrNet is demonstrated on datasets featuring synthetic and real images, including ShapeNet, PASCAL 3D+, and the IKEA dataset. MarrNet outperforms baseline models in qualitative and quantitative metrics, showing a higher IoU and being preferred by human evaluators in direct comparisons. Notably, the paper highlights MarrNet's capability to maintain details and smoothness in 3D reconstructions, which are crucial for real-world applications.

The authors also address specific limitations and failure cases. MarrNet struggles with thin or complex structures and is sensitive to inaccuracies in initial masking. Despite these challenges, MarrNet shows robustness across a variety of object classes, including mixed categories like chairs, airplanes, and cars.

Implications and Future Work

The proposed method bridges significant gaps in single-view 3D reconstruction by effectively transferring learned knowledge from synthetic to real-world scenarios. The flexibility to train without extensive 3D shape annotations presents a practical advantage for scaling the model's applicability. Additionally, the use of 2.5D sketches as intermediate representations is theoretically informed by human visual perception theories, potentially guiding further exploration in cognitive modeling and multi-view reconstructions.

Future research directions could focus on enhancing MarrNet's capabilities in handling occlusions and more intricate geometries, as well as extending its application to additional contexts such as dynamic scenes or interactive environments. Integration with newer volumetric representations and further optimization of the end-to-end training process may also enhance performance and adaptability.

PDF Markdown