Cascade Residual Learning: A Two-stage Convolutional Neural Network for Stereo Matching (1708.09204v2)

Published 30 Aug 2017 in cs.CV

Abstract: Leveraging on the recent developments in convolutional neural networks (CNNs), matching dense correspondence from a stereo pair has been cast as a learning problem, with performance exceeding traditional approaches. However, it remains challenging to generate high-quality disparities for the inherently ill-posed regions. To tackle this problem, we propose a novel cascade CNN architecture composing of two stages. The first stage advances the recently proposed DispNet by equipping it with extra up-convolution modules, leading to disparity images with more details. The second stage explicitly rectifies the disparity initialized by the first stage; it couples with the first-stage and generates residual signals across multiple scales. The summation of the outputs from the two stages gives the final disparity. As opposed to directly learning the disparity at the second stage, we show that residual learning provides more effective refinement. Moreover, it also benefits the training of the overall cascade network. Experimentation shows that our cascade residual learning scheme provides state-of-the-art performance for matching stereo correspondence. By the time of the submission of this paper, our method ranks first in the KITTI 2015 stereo benchmark, surpassing the prior works by a noteworthy margin.

Citations (448)

View on Semantic Scholar

Summary

The paper introduces a two-stage CNN that first generates full-resolution disparity maps and then refines them via residual learning.
It employs multiscale residual signals to correct disparities in challenging regions like occlusions and textureless areas.
The approach outperforms previous methods on benchmarks such as KITTI 2015, demonstrating its potential for improved depth perception.

Cascade Residual Learning Framework for Stereo Matching

The paper "Cascade Residual Learning: A Two-stage Convolutional Neural Network for Stereo Matching" presents a novel approach to address the challenge of generating high-quality disparities from stereo image pairs. This paper leverages the power of convolutional neural networks (CNNs) and introduces a two-stage cascade CNN architecture specifically designed for stereo matching tasks.

Overview

Stereo matching, a critical task in computer vision, involves estimating depth by matching corresponding pixels in a stereo image pair. Traditional approaches often struggle in ill-posed regions, such as occlusions and textureless areas. The authors propose a Cascade Residual Learning (CRL) framework to enhance the accuracy of disparity maps by mitigating these challenges.

Methodology

The proposed CRL framework is composed of two distinct stages:

First Stage (DispFulNet): This stage builds upon the DispNet architecture by integrating additional up-convolution modules, producing full-resolution disparity maps with enhanced details. The network's structure ensures a fine-grained initial disparity estimation, setting a solid foundation for subsequent rectifications.
Second Stage (DispResNet): Instead of directly learning the disparity, this stage focuses on residual learning across multiple scales. It refines the disparity map generated by the first stage using multiscale residual signals, which are easier to learn as they encompass only the necessary corrections. This process not only improves disparity accuracy but also simplifies network training, reducing the risk of overfitting.

Experimental Results

The CRL approach demonstrated superior performance across several datasets, including FlyingThings3D and KITTI 2015. It achieved state-of-the-art results, ranking first in the KITTI 2015 stereo benchmark, performing significantly better than previous methods. The residual learning strategy particularly excelled in refining disparities in complex image regions.

Implications and Future Directions

The CRL framework's success indicates the potential of multistage architectures and residual learning in improving depth perception accuracy in computer vision systems. The findings suggest that similar principles could be applied to other vision tasks, such as optical flow estimation and monocular depth prediction.

Future research could explore the application of unsupervised or semi-supervised learning paradigms to reduce dependency on extensive labeled datasets. Additionally, integrating more robust mechanisms, like left-right consistency checks, might further enhance network reliability and performance in varied environmental conditions.

In conclusion, the Cascade Residual Learning framework offers a promising avenue for advancing stereo matching technology, highlighting the synergy of multistage processing and residual learning in handling complex visual tasks.

PDF Markdown