Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches (1510.05970v2)

Published 20 Oct 2015 in cs.CV, cs.LG, and cs.NE

Abstract: We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with examples of similar and dissimilar pairs of patches. We examine two network architectures for this task: one tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. A series of post-processing steps follow: cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter. We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets and show that it outperforms other approaches on all three data sets.

Citations (1,354)

View on Semantic Scholar

Summary

The paper introduces a CNN-based approach that learns similarity measures on image patches, replacing traditional hand-crafted matching costs.
It presents both a fast architecture for real-time processing and an accurate model that achieves lower error rates on KITTI and Middlebury benchmarks.
The method integrates cost aggregation, semiglobal matching, and filtering techniques to refine disparity maps and enhance depth estimation accuracy.

Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches

The paper "Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches" by Jure Žbontar and Yann LeCun explores an advanced method for extracting depth information from rectified image pairs. Specifically, it focuses on the first stage of most stereo algorithms: computing the matching cost.

Traditionally, stereo matching involves calculating the matching cost, aggregating these costs, optimizing the aggregated costs, and refining the resultant disparity map. Žbontar and LeCun's work innovates by employing a Convolutional Neural Network (CNN) to learn a similarity measure on small image patches, an approach representing a shift from hand-crafted metrics to learned representations.

Methods

The CNN is trained in a supervised manner using a dataset consisting of pairs of image patches labeled as similar or dissimilar. The authors examine two network architectures: one optimized for speed (fast architecture) and another for accuracy (accurate architecture). Both architectures involve using CNNs to extract feature vectors from image patches and a similarity measure to compare these vectors. However, the fast architecture utilizes a fixed cosine similarity, while the accurate architecture employs fully connected layers to learn the similarity score.

The fast architecture requires only up to 0.8 seconds to process an image pair, making it suitable for real-time applications. In contrast, the accurate architecture, while slower with a runtime of 67 seconds per image pair, delivers lower error rates and higher accuracy.

The computed matching cost from the CNN undergoes several post-processing steps:

Cross-based cost aggregation: Combines matching costs of neighboring pixels with similar intensities to refine the cost.
Semiglobal matching: Enforces smoothness constraints and reduces sensitivity to noise.
Left-right consistency check: Identifies and handles occlusions.
Subpixel enhancement: Increases disparity resolution by fitting a quadratic curve.
Median and bilateral filters: Further refine the disparity map to preserve edges and reduce noise.

Results

The proposed method outperforms existing approaches on the KITTI 2012, KITTI 2015, and Middlebury datasets, achieving the lowest error rates in the literature. For KITTI 2012, the accurate architecture achieves a 2.43% error rate, while the fast architecture scores 2.82%. On KITTI 2015, the corresponding error rates are 3.89% and 4.62%, and for the Middlebury dataset, the accurate architecture leads with an 8.29% error rate.

Implications and Future Directions

This paper's findings have significant implications for the development of stereo algorithms:

Practical Implications: The fast architecture's capability for real-time processing makes it highly suitable for applications such as autonomous driving and robotics, where swift and accurate depth estimation is crucial.
Theoretical Implications: By demonstrating the efficacy of learning-based approaches over traditional hand-crafted methods, this work sets a precedent for adopting similar CNN-based techniques in related computer vision tasks.
Future Developments: The promising results suggest that further advancements can be achieved by increasing the training dataset size, refining network architectures, or incorporating additional contextual information.

The approach of using convolutional neural networks for computing the matching cost underscores the potential of deep learning to enhance accuracy and efficiency in stereo vision tasks. The methodology and results from this work pave the way for future research to experiment with more sophisticated network architectures and larger-scale datasets, thereby pushing the boundaries of depth estimation and beyond.

PDF Markdown

Related Papers

GitHub

GitHub - jzbontar/mc-cnn: Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches (719 stars)