Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Marr Revisited: 2D-3D Alignment via Surface Normal Prediction (1604.01347v1)

Published 5 Apr 2016 in cs.CV

Abstract: We introduce an approach that leverages surface normal predictions, along with appearance cues, to retrieve 3D models for objects depicted in 2D still images from a large CAD object library. Critical to the success of our approach is the ability to recover accurate surface normals for objects in the depicted scene. We introduce a skip-network model built on the pre-trained Oxford VGG convolutional neural network (CNN) for surface normal prediction. Our model achieves state-of-the-art accuracy on the NYUv2 RGB-D dataset for surface normal prediction, and recovers fine object detail compared to previous methods. Furthermore, we develop a two-stream network over the input image and predicted surface normals that jointly learns pose and style for CAD model retrieval. When using the predicted surface normals, our two-stream network matches prior work using surface normals computed from RGB-D images on the task of pose prediction, and achieves state of the art when using RGB-D input. Finally, our two-stream network allows us to retrieve CAD models that better match the style and pose of a depicted object compared with baseline approaches.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Aayush Bansal (20 papers)
  2. Bryan Russell (36 papers)
  3. Abhinav Gupta (178 papers)
Citations (217)

Summary

Understanding 3D Object Recognition Through Surface Normal Predictions

This paper presents a methodical deep learning framework aimed at enhancing the accuracy of 3D object recognition from 2D images. The authors approach the complex problem of bridging 2D image inputs to 3D model outputs by leveraging surface normal predictions, which act as an intermediate representation known as a 2.5D sketch. They incorporate these predictions with visual cues from images to retrieve closely matched 3D models from extensive CAD databases.

The paper's methodology revolves around two key innovations. Firstly, they introduce a sophisticated skip-network architecture that builds upon a pre-trained VGG convolutional neural network (CNN) to predict surface normals accurately. Their model exhibits state-of-the-art results in surface normal prediction on the NYUv2 dataset, crucially capturing fine object details often missed by previous models. Additionally, the proposed model reduces angular error in surface normal predictions significantly, demonstrating noteworthy improvements in mean and median error metrics compared to traditional methods.

Secondly, the paper develops a two-stream network architecture that integrates image appearances and predicted surface normals to jointly learn object pose and style for CAD model retrieval. This model shows competitive performance in pose estimation tasks, achieving results on par with, and in some configurations exceeding, existing systems that rely on RGB-D data. This capability is particularly beneficial since obtaining surface normals from RGB inputs allows for applicability in environments lacking depth data.

Quantitative evaluations underscore the model's efficacy. The authors perform comprehensive tests dividing the evaluation into global scene layout and localized object layout (focusing on categories like chair, sofa, and bed). The results substantially surpass prior benchmarks in surface normal accuracy, showing notable improvements in capture depth and detail, thereby leading to better alignment with the Marr’s theoretical framework for visual perception.

The implications of this research are both practical and theoretical. Practically, the model's ability to faithfully reproduce detailed 2.5D representations enables more accurate and efficient retrieval and reconstruction of 3D models across diverse applications in graphics and robotics. Theoretically, the revival of Marr's sequential processing model enhances our understanding of perception and draws attention to the reconciliation of intermediate structure (2.5D) with volumetric representation (3D).

Looking forward, advances in the proposed domain open avenues for further research into unsupervised learning of surface normals and exploring broader applications such as augmented reality where real-time processing is crucial. The integration of additional sensory information could refine and expand the capabilities of such systems, pushing the envelope towards a fuller understanding of 3D world reconstruction from 2D images.

In summary, this paper provides a substantial contribution to the field of computer vision by establishing a bridge from 2D images to 3D model retrieval through the adept use of surface normal predictions. This framework not only achieves impressive numerical results but reinforces foundational theories of visual perception, thereby guiding future research directions in 3D scene understanding.