End-to-End Learning of Geometry and Context for Deep Stereo Regression (1703.04309v1)

Published 13 Mar 2017 in cs.CV and cs.NE

Abstract: We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new state-of-the-art benchmark, while being significantly faster than competing approaches.

Citations (1,256)

View on Semantic Scholar

Summary

The paper proposes GC-Net, an end-to-end architecture that combines cost volume formation with 3-D convolutional regularization for improved disparity estimation.
The method introduces a soft argmin operation to achieve sub-pixel accuracy, significantly reducing mean disparity error on benchmark datasets.
GC-Net integrates geometric cues with contextual learning to enhance stereo matching robustness, benefiting applications such as autonomous driving and 3D reconstruction.

Overview of End-to-End Learning of Geometry and Context for Deep Stereo Regression

The paper "End-to-End Learning of Geometry and Context for Deep Stereo Regression" by Alex Kendall and colleagues presents an innovative approach to stereo vision using deep learning. The authors propose GC-Net (\underline{G}eometry and \underline{C}ontext \underline{Net}work), an end-to-end deep learning architecture designed for regressing disparity from rectified stereo image pairs. A central problem in stereo vision is estimating the correspondence between pixels in different images to calculate depth. Traditionally, this problem has faced challenges due to textureless areas, reflective surfaces, thin structures, and repetitive patterns.

Key Contributions

Cost Volume Formation

GC-Net leverages the geometric properties inherent to the stereo vision problem by forming a cost volume from deep feature representations. The cost volume is constructed for each disparity level by concatenating unary features from the left and right images. This approach retains the depth of feature unaries, enabling more effective learning of global semantic context.

3-D Convolutions for Contextual Learning

To incorporate contextual information, GC-Net utilizes 3-D convolutions over the cost volume. The 3-D convolutions allow the network to learn regularization functions that refine disparity estimates by incorporating context from height, width, and disparity dimensions. This is a significant improvement over prior methods that primarily relied on local feature matching.

Soft Argmin for Sub-Pixel Accuracy

The paper introduces a differentiable soft argmin operation to regress disparity values. This operation converts matching costs into probabilities using a softmax function and then computes the disparity by summing the product of each disparity index and its corresponding probability. The result is a smooth disparity estimate with sub-pixel accuracy, which is a notable advantage over traditional argmin operations that are discrete and non-differentiable.

Methodology

The architecture is divided into several stages:

Unary Feature Extraction: The model uses 2-D convolutions to learn deep feature representations from raw input images. Both stereo images are passed through shared convolutional layers.
Cost Volume Construction: Unary features from the left and right images are concatenated across disparity levels to form a 4D cost volume.
3-D Convolutional Regularization: These newly formed cost volumes are then processed by a hierarchy of 3-D convolutional layers to learn context-aware regularization. The network's encoder-decoder structure allows for a reduced computational burden while retaining a large field of view.
Soft Argmin Operation: The final regularized cost volume undergoes a soft argmin operation to regress disparity values directly.

Evaluation

GC-Net was evaluated on the synthetic Scene Flow dataset and the KITTI benchmarks. The results were convincing, showcasing an improved state-of-the-art performance on the KITTI 2012 and 2015 datasets. GC-Net outperformed previous methods in terms of mean disparity error and percentage of pixels with significant error, demonstrating its robustness and accuracy.

Key Numerical Results:

KITTI 2012: The method achieved a notable improvement with a lower mean absolute error (0.6 px).
KITTI 2015: GC-Net outperformed with a D1-all error of 2.87%, indicating superior performance compared to existing methods.

Implications and Future Directions

The implications of this research are substantial for various computer vision applications, including autonomous driving, UAV navigation, and 3D scene reconstruction. The proposed method not only improves accuracy but also reduces computational time, making it practical for real-time applications. Future research could explore integrating more explicit semantic representations and leveraging Bayesian techniques to handle uncertainties in stereo vision tasks.

In summary, GC-Net stands out due to its novel integration of geometric knowledge and deep learning for stereo regression, providing a significant step forward in the accuracy and efficiency of disparity estimation. The use of 3-D convolutions for contextually informed regularization and the innovative soft argmin operation for sub-pixel accuracy are key contributions that have set new benchmarks in stereo vision research.

PDF Markdown

Related Papers

YouTube

Show All Videos