Computing the Stereo Matching Cost with a Convolutional Neural Network (1409.4326v2)

Published 15 Sep 2014 in cs.CV, cs.LG, and cs.NE

Abstract: We present a method for extracting depth information from a rectified image pair. We train a convolutional neural network to predict how well two image patches match and use it to compute the stereo matching cost. The cost is refined by cross-based cost aggregation and semiglobal matching, followed by a left-right consistency check to eliminate errors in the occluded regions. Our stereo method achieves an error rate of 2.61 % on the KITTI stereo dataset and is currently (August 2014) the top performing method on this dataset.

Citations (749)

View on Semantic Scholar

Summary

The paper introduces a CNN-based approach that computes stereo matching costs using supervised learning on patch pairs.
It employs cross-based cost aggregation and semiglobal matching to refine disparity estimates, achieving a 2.61% error rate on KITTI.
The method sets a new performance benchmark for stereo vision, paving the way for real-time applications in autonomous driving and 3D reconstruction.

Computing the Stereo Matching Cost with a Convolutional Neural Network

The paper "Computing the Stereo Matching Cost with a Convolutional Neural Network" by Jure Žbontar and Yann LeCun presents a method to extract depth information from a pair of rectified images through the use of convolutional neural networks (CNNs). This method is pivotal for applications such as autonomous driving and 3D reconstruction, where precise depth perception is crucial.

Abstract and Contributions

The authors propose and evaluate a CNN-based approach to compute the stereo matching cost. The method consists of training the network on pairs of small image patches to predict their matching quality. This matching cost is subsequently refined using cross-based cost aggregation and semiglobal matching. The resulting method achieves a remarkable 2.61% error rate on the KITTI stereo dataset, outperforming prior methods.

The primary contributions of this paper are:

Introducing a CNN to compute the stereo matching cost, effectively leveraging supervised learning.
Achieving a significant reduction in error rate on the KITTI dataset, from 2.83% to 2.61%.

Methodology

The stereo matching problem involves determining the disparity for each pixel between left and right images. Disparity, the horizontal shift of an object between two images, can be translated to object depth through a known camera baseline. The authors address this problem by emphasizing the stages of matching cost computation, cost aggregation, optimization, and disparity refinement.

CNN for Matching Cost Computation

The CNN is trained on 9x9 grayscale image patches, producing a measure of similarity between two patches. Each training example consists of a pair of patches from the left and right images, labeled as a positive or negative match based on their known disparity. The network's architecture includes one convolutional layer followed by seven fully connected layers, culminating in a softmax output that classifies the match quality.

Cross-Based Cost Aggregation

The computed matching costs are aggregated using cross-based cost aggregation, which adaptively selects local neighborhoods based on image intensity similarities, reducing errors at depth discontinuities. This aggregation iteratively refines the cost by averaging over dynamically selected support regions.

Semiglobal Matching

Smoothness constraints on the disparity image are enforced using semiglobal matching, which optimizes the matching cost along multiple image directions (horizontal and vertical) and averages the results. This step mitigates streaking artifacts common in simpler dynamic programming approaches.

Post-Processing

To refine the initial disparity map, the method employs subpixel enhancement, left-right consistency checks, and bilateral filtering. These additional steps ensure higher accuracy and subpixel resolution for the final disparity map.

Results

The method was evaluated on the KITTI stereo dataset, comprising image pairs captured from a moving car in real-world conditions. The proposed method demonstrated superior performance, achieving the lowest error rate compared to existing top-tier methods.

| Rank | Method | Error Rate | ||-|| | 1 | MC-CNN (this paper) | 2.61% | | 2 | SPS-StFl | 2.83% | | 3 | VC-SF | 3.05% |

The runtime analysis revealed that the convolutional neural network's forward pass dominates the computational cost, taking approximately 95 seconds per image pair on an Nvidia GeForce GTX Titan GPU.

Implications and Future Work

The findings of this paper underscore the effectiveness of CNNs in stereo vision tasks, particularly for computing matching costs, a critical component of stereo methods. The strong numerical results indicate potential for further improvements as larger datasets and more sophisticated network architectures become available.

Future developments may focus on real-time implementations, reducing the computational overhead to make the method viable for applications such as robot navigation and autonomous driving. Additionally, integrating supervised learning for other components of the stereo method, such as cost aggregation and disparity refinement, could yield further gains in accuracy and robustness.

Conclusion

This paper successfully demonstrates the application of convolutional neural networks for stereo matching cost computation, achieving state-of-the-art results on a challenging benchmark. The method not only provides theoretical insights into the use of deep learning for stereo vision but also sets a new performance standard in practical depth extraction from stereo images. Further research will likely explore scaling this approach to more extensive datasets and real-time scenarios.

PDF Markdown