GA-Net: Guided Aggregation Net for End-to-end Stereo Matching (1904.06587v1)

Published 13 Apr 2019 in cs.CV

Abstract: In the stereo matching task, matching cost aggregation is crucial in both traditional methods and deep neural network models in order to accurately estimate disparities. We propose two novel neural net layers, aimed at capturing local and the whole-image cost dependencies respectively. The first is a semi-global aggregation layer which is a differentiable approximation of the semi-global matching, the second is the local guided aggregation layer which follows a traditional cost filtering strategy to refine thin structures. These two layers can be used to replace the widely used 3D convolutional layer which is computationally costly and memory-consuming as it has cubic computational/memory complexity. In the experiments, we show that nets with a two-layer guided aggregation block easily outperform the state-of-the-art GC-Net which has nineteen 3D convolutional layers. We also train a deep guided aggregation network (GA-Net) which gets better accuracies than state-of-the-art methods on both Scene Flow dataset and KITTI benchmarks.

Citations (624)

View on Semantic Scholar

Summary

The paper introduces GA-Net with two guided aggregation layers that replace expensive 3D convolutions for stereo matching.
It employs a Semi-Global Aggregation layer for comprehensive cost smoothing and a Local Guided Aggregation layer to enhance fine edge details.
The model achieves superior accuracy and efficiency, running at 15-20 fps while outperforming state-of-the-art methods on key benchmarks.

Overview of GA-Net: Guided Aggregation Net for End-to-End Stereo Matching

The paper "GA-Net: Guided Aggregation Net for End-to-End Stereo Matching" by Zhang et al. introduces an innovative approach to address the stereo matching problem, a crucial task in fields like computer vision and autonomous driving. The authors propose two novel neural network layers that enhance the accuracy of disparity estimation by capturing both local and global cost dependencies efficiently.

Problem Context

Stereo matching involves estimating 3D geometry by calculating disparities between matching pixels in a stereo image pair. Traditional challenges include handling occlusions, textureless areas, and reflective surfaces. Prior methods primarily focused on stages like feature extraction, matching cost aggregation, and disparity prediction, often requiring computationally intensive 3D convolutions.

Proposed Solution

The authors present GA-Net, which includes two novel layers that replace costly 3D convolutional layers:

Semi-Global Aggregation (SGA) Layer: This layer offers a differentiable approximation of the semi-global matching (SGM) technique, enabling effective cost aggregation over the entire image in multiple directions. It aims to improve accuracy in occluded and textureless areas.
Local Guided Aggregation (LGA) Layer: Aimed at preserving thin structures and enhancing detail around edges, this layer refines the disparity map using a guided filtering strategy.

Key Findings and Results

The proposed GA-Net demonstrates superior performance over existing state-of-the-art methods like GC-Net and PSMNet. Specifically:

With only two GA layers and two 3D convolutional layers, GA-Net significantly outperforms GC-Net, which uses nineteen 3D convolutional layers.
GA-Net achieves better accuracy on datasets like Scene Flow and KITTI, while notably reducing computational complexity, with each GA layer having only 1/100 the FLOPs of a 3D convolution.
The real-time model of GA-Net runs at 15-20 fps and outperforms other real-time algorithms in terms of accuracy.

Theoretical and Practical Implications

The introduction of GA layers represents a theoretical advancement in disparity estimation methodologies by providing a differentiable alternative to traditional non-differentiable aggregation methods. Practically, the reduced computational burden positions GA-Net as a feasible solution for real-time applications in sectors like robotics and autonomous systems where rapid processing is crucial.

Future Directions

Future work could further explore the adaptability of GA layers to additional computer vision tasks beyond stereo matching. Moreover, extending the GA layers for efficient scalability across varying resolutions and exploring their integration with multi-task learning frameworks could enhance their utility.

In summary, this paper proposes a well-founded upgrade in stereo matching techniques, offering significant improvements in computational efficiency and accuracy.

PDF Markdown