Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Iterative Geometry Encoding Volume for Stereo Matching (2303.06615v2)

Published 12 Mar 2023 in cs.CV

Abstract: Recurrent All-Pairs Field Transforms (RAFT) has shown great potentials in matching tasks. However, all-pairs correlations lack non-local geometry knowledge and have difficulties tackling local ambiguities in ill-posed regions. In this paper, we propose Iterative Geometry Encoding Volume (IGEV-Stereo), a new deep network architecture for stereo matching. The proposed IGEV-Stereo builds a combined geometry encoding volume that encodes geometry and context information as well as local matching details, and iteratively indexes it to update the disparity map. To speed up the convergence, we exploit GEV to regress an accurate starting point for ConvGRUs iterations. Our IGEV-Stereo ranks $1{st}$ on KITTI 2015 and 2012 (Reflective) among all published methods and is the fastest among the top 10 methods. In addition, IGEV-Stereo has strong cross-dataset generalization as well as high inference efficiency. We also extend our IGEV to multi-view stereo (MVS), i.e. IGEV-MVS, which achieves competitive accuracy on DTU benchmark. Code is available at https://github.com/gangweiX/IGEV.

Citations (125)

Summary

  • The paper presents IGEV-Stereo, which integrates iterative geometry encoding and lightweight 3D regularization to enhance stereo matching accuracy and speed.
  • It addresses local ambiguities in ill-posed regions by fusing non-local geometric cues with all-pairs correlation features.
  • Key benchmarks include top KITTI leaderboard placements and a record-low Scene Flow EPE of 0.47, demonstrating its superior performance.

Iterative Geometry Encoding Volume for Stereo Matching

The paper "Iterative Geometry Encoding Volume for Stereo Matching" by Xu et al. presents a novel approach to enhancing stereo matching performance through Iterative Geometry Encoding Volume (IGEV-Stereo). The authors address limitations in existing stereo methods, particularly those leveraging Recurrent All-Pairs Field Transforms (RAFT), by incorporating non-local geometry information to better resolve local ambiguities in ill-posed regions.

The paper introduces IGEV-Stereo, a new architecture that combines geometry encoding with context information. This approach aims to improve upon RAFT-Stereo by constructing a Geometry Encoding Volume (GEV) through lightweight 3D regularization. GEV is then combined with local all-pairs correlations to create a Combined Geometry Encoding Volume (CGEV), which is iteratively indexed for disparity map updates. This iterative strategy capitalizes on geometry and context-rich features, enhancing convergence speed and reducing computational overheads compared to traditional methods requiring extensive 3D convolutions.

Key results demonstrate that IGEV-Stereo outperforms existing methods, ranking first on the KITTI 2015 and 2012 leaderboards, and is the fastest among the top contenders. The method achieves an exceptionally low End Point Error (EPE) of 0.47 on the Scene Flow dataset, evidencing both accuracy and efficiency. Furthermore, IGEV-Stereo maintains strong cross-dataset generalization capabilities, performing well on real datasets like Middlebury and ETH3D, even when trained solely on synthetic data.

The paper also extends the concept to multi-view stereo (IGEV-MVS), where it delivers competitive results on the DTU benchmark. The MVS extension omits context networks, streamlining the iterative update process, and still achieves favorable accuracy and completeness scores.

In terms of theoretical implications, this work contributes a refined perspective on iteratively optimizing disparity maps by integrating complex geometric structures into the matching process. Practically, the architecture's reduced computational demands make it a promising option for real-time applications in 3D reconstruction, robotics, and autonomous driving.

While the current implementation efficiently addresses high-resolution tasks, further research might explore even more lightweight architectures or incorporate cascaded cost volumes to mitigate potential computational expenses linked with very large-scale scenes. Future investigations could also refine the regularization processes to further bolster performance against high-disparity environments.

Overall, this paper marks a significant step towards more efficient and effective stereo matching technologies, balancing robust disparity estimation with pragmatic execution speed, setting a robust foundation for continued advancements in computer vision applications.

Github Logo Streamline Icon: https://streamlinehq.com