Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture (1411.4734v4)

Published 18 Nov 2014 in cs.CV

Abstract: In this paper we address three different computer vision tasks using a single basic architecture: depth prediction, surface normal estimation, and semantic labeling. We use a multiscale convolutional network that is able to adapt easily to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.

Citations (2,612)

View on Semantic Scholar

Summary

The paper leverages a common multi-scale CNN architecture to integrate depth prediction, surface normal estimation, and semantic labeling.
It employs a coarse-to-fine approach across three scales to capture global context and fine details, achieving superior benchmark performance.
The model demonstrates significant improvements in accuracy and computational efficiency, enabling advances in robotics, augmented reality, and autonomous systems.

Predicting Depth, Surface Normals, and Semantic Labels with a Common Multi-Scale Convolutional Architecture

Introduction

The paper, authored by David Eigen and Rob Fergus, presents a unified approach to tackle three fundamental tasks in computer vision: depth prediction, surface normal estimation, and semantic labeling. This is achieved through a single, versatile multiscale convolutional neural network (CNN) architecture. Each of these tasks is addressed by making minimal modifications to the underlying network architecture, which directly regresses from the input image to the desired output maps. State-of-the-art performance is achieved across multiple evaluation benchmarks, demonstrating the robust versatility of the proposed method.

Model Architecture

The authors propose a multi-scale deep network that starts with a coarse global output and progressively refines it using finer-scale local networks. The network architecture is divided into three scales:

Scale 1: Full-Image View - This initial scale captures features using a large, full-image field of view. The output is coarse but spatially variant and upsampled to a quarter-scale resolution.
Scale 2: Predictions - The second scale produces mid-level resolution predictions by incorporating detailed, local image views combined with full-image features from Scale 1.
Scale 3: Higher Resolution Refinement - The final scale refines the predictions to a higher resolution by integrating finer-level details from the input image.

Tasks and Loss Functions

Depth Prediction: For depth prediction, the loss function compares the predicted and ground-truth log depth maps, incorporating both $l_2$ norm and image gradient penalties to encourage structural accuracy.
Surface Normal Estimation: The task is structured as a regression problem for predicting the $x$ , $y$ , and $z$ components of normals and utilizes a dot product based loss to directly compare the predictions with the ground truth.
Semantic Labeling: Semantic labeling deploys a pixelwise softmax classifier with a cross-entropy loss function to predict class labels for each pixel. Augmented inputs such as depth and surface normals enhance performance on this task.

Performance Evaluation

Depth Prediction: The method outperforms existing approaches as demonstrated by superior results on NYU Depth v2 benchmarks. Metrics such as absolute relative difference and root mean squared error show significant improvements, particularly when using a deeper model initialized with VGG network weights.
Surface Normal Estimation: Remarkable improvement is recorded in the mean and median angular errors, surpassing prior methods. The use of multiscale architecture proves beneficial in preserving fine details and capturing global geometric features.
Semantic Labeling: Evaluation on NYU Depth v2 and Sift Flow datasets shows that the proposed method achieves higher pixel accuracy and class accuracy compared to several earlier approaches. Notable improvements are seen even in 40-class segmentation tasks, illustrating the model's scalability.

Contributions of Network Scales

Probe experiments underscore the importance of each scale in the network. For depth and surface normal tasks, the largest single contribution comes from the coarse scale which captures the global context. For semantic labeling, finer scales contribute significantly when augmented with inferred depth and normal inputs. This interplay between global and local feature extraction is critical for achieving high accuracy in various tasks.

Practical and Theoretical Implications

The presented architecture simplifies the integration of multiple vision tasks into a single framework, facilitating applications in robotics, augmented reality, and autonomous systems. Shared computation in predicting depth and normals ensures computational efficiency. Moreover, it paves the way towards generalized pixel-map regression models, minimizing the need for task-specific architectures.

Conclusion and Future Directions

The paper demonstrates that a common multiscale convolutional architecture can robustly handle diverse vision tasks. Future research could explore extending this architecture to additional modalities, optimizing computational efficiency further, and incorporating more complex spatial reasoning abilities into the network. The work also suggests potential for real-time applications, with qualitative results aligning closely with state-of-the-art methods.

In summary, the proposed approach effectively leverages deep learning techniques to integrate multiple vision tasks into a unified model, providing significant advancements in depth prediction, surface normal estimation, and semantic labeling. As the field of AI continues to evolve, such integrative models are likely to play a crucial role in developing more sophisticated, multi-functional systems.

PDF Markdown