Group-wise Correlation Stereo Network (1903.04025v1)

Published 10 Mar 2019 in cs.CV

Abstract: Stereo matching estimates the disparity between a rectified image pair, which is of great importance to depth sensing, autonomous driving, and other related tasks. Previous works built cost volumes with cross-correlation or concatenation of left and right features across all disparity levels, and then a 2D or 3D convolutional neural network is utilized to regress the disparity maps. In this paper, we propose to construct the cost volume by group-wise correlation. The left features and the right features are divided into groups along the channel dimension, and correlation maps are computed among each group to obtain multiple matching cost proposals, which are then packed into a cost volume. Group-wise correlation provides efficient representations for measuring feature similarities and will not lose too much information like full correlation. It also preserves better performance when reducing parameters compared with previous methods. The 3D stacked hourglass network proposed in previous works is improved to boost the performance and decrease the inference computational cost. Experiment results show that our method outperforms previous methods on Scene Flow, KITTI 2012, and KITTI 2015 datasets. The code is available at https://github.com/xy-guo/GwcNet

Citations (493)

View on Semantic Scholar

Summary

The paper presents a group-wise correlation method that partitions features to construct cost volumes with higher accuracy and efficiency in stereo matching.
It demonstrates a significant reduction in end-point error and lower disparity errors on benchmarks like Scene Flow and KITTI datasets.
Enhanced modifications to the 3D stacked hourglass network facilitate real-time performance, making it suitable for applications in autonomous driving and robotics.

Analysis of "Group-wise Correlation Stereo Network"

The paper "Group-wise Correlation Stereo Network" introduces an advanced approach to stereo matching, a fundamental task in computer vision critical to depth sensing applications such as autonomous driving and robot navigation. The authors propose an innovative way to construct cost volumes using group-wise correlation, which efficiently represents feature similarities without the information loss associated with full correlation methods.

Methodology Overview

Stereo matching typically involves estimating the disparity between a pair of rectified images, translating this disparity into depth measurements. Traditional methods involve calculating matching costs using Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD), or Normalized Cross-Correlation (NCC), and then aggregating these costs with methods such as winner-takes-all strategies or global optimization techniques like belief propagation and graph cuts. The advancement and automation of these methodologies using neural networks have gained substantial interest, with previous models such as DispNetC and PSMNet utilizing techniques like 3D convolutional networks for disparity prediction.

In this paper, the authors focus on addressing the shortcomings of previous models, notably the information loss in full correlation and the parameter demands of concatenation volumes. They propose a new operation: group-wise correlation. This method involves partitioning features into groups along the channel dimension and then computing correlation maps for each group individually. This operation is a strategic synthesis of the efficiency of cross-correlation and the richness of concatenation, providing more nuanced matching cost proposals. The resultant cost volume integrates these group-wise insights, contributing to an enhanced performance-capacity trade-off.

Results and Performance

The evaluation centers on the comparative performance of their proposed Group-wise Correlation Network (GwcNet) against established models on benchmarks such as Scene Flow, KITTI 2012, and KITTI 2015 datasets. On Scene Flow, the proposed GwcNet demonstrated a reduction in end-point error (EPE) to 1.188 pixels, signifying a marked improvement over correlational and concatenation methods independently. On KITTI datasets, GwcNet outperformed the previous state-of-the-art models by yielding lower D1-all percentage errors, showcasing the robustness and precision of their cost volume construction.

Network Architecture and Efficiency

A notable augmentation of the proposed approach is the improved 3D stacked hourglass network. By introducing a set of practical modifications—such as added auxiliary output modules, streamlined shortcut connections, and enhanced convolution operations—the authors have crafted an architecture that not only boosts prediction accuracy but also decreases computational demand. Consequently, their method achieves real-time applicability without sacrificing performance, significantly beneficial for real-world implementation in resource-constrained environments.

Implications and Future Work

The practical implications of this work are profound, particularly in enhancing the performance of depth sensing systems in autonomous vehicles. The group-wise correlation technique lends itself well to deployment in applications where precision and resource efficiency are paramount. Theoretically, this research opens avenues for further exploration into channel-wise partitioning strategies in convolutional layers, perhaps inspiring similar implementations in tasks beyond stereo matching.

Future developments in this domain could explore dynamic adjustments of group sizes based on scene complexity or iterative refinement approaches that capitalize on the group-wise correlations to self-correct predictions in challenging scenarios. Moreover, there remains potential for integrating additional contextual information into the group-wise correlation, such as semantic segmentation cues, to further leverage the synergy between different visual information layers.

In conclusion, the Group-wise Correlation Stereo Network represents a substantial step forward in stereo matching methodologies, characterized by an innovative cost volume construction that deftly balances computational efficiency with prediction accuracy. The research provides a blueprint for enhancing depth estimation systems, with cascading effects anticipated in domains requiring precise visual perception.

PDF Markdown