Global Occlusion-Aware Transformer for Robust Stereo Matching (2312.14650v1)
Abstract: Despite the remarkable progress facilitated by learning-based stereo-matching algorithms, the performance in the ill-conditioned regions, such as the occluded regions, remains a bottleneck. Due to the limited receptive field, existing CNN-based methods struggle to handle these ill-conditioned regions effectively. To address this issue, this paper introduces a novel attention-based stereo-matching network called Global Occlusion-Aware Transformer (GOAT) to exploit long-range dependency and occlusion-awareness global context for disparity estimation. In the GOAT architecture, a parallel disparity and occlusion estimation module PDO is proposed to estimate the initial disparity map and the occlusion mask using a parallel attention mechanism. To further enhance the disparity estimates in the occluded regions, an occlusion-aware global aggregation module (OGA) is proposed. This module aims to refine the disparity in the occluded regions by leveraging restricted global correlation within the focus scope of the occluded areas. Extensive experiments were conducted on several public benchmark datasets including SceneFlow, KITTI 2015, and Middlebury. The results show that the proposed GOAT demonstrates outstanding performance among all benchmarks, particularly in the occluded regions.
- Pyramid stereo matching network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5410–5418, 2018.
- Attention-aware feature aggregation for real-time stereo matching on edge devices. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2020.
- Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pages 10615–10622, 2020.
- Hierarchical neural architecture search for deep stereo matching. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 22158–22169. Curran Associates, Inc., 2020.
- Gate-variants of gated recurrent unit (gru) neural networks. In IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), pages 1597–1600. IEEE, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Group-wise correlation stereo network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3268–3277, 2019.
- Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9772–9781, 2021.
- End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 66–75, 2017.
- Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 573–590, 2018.
- Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6197–6206, 2021.
- Raft-stereo: Multilevel recurrent field transforms for stereo matching. In International Conference on 3D Vision (3DV), pages 218–227. IEEE, 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021.
- Digging into normal incorporated stereo matching. In Proceedings of the ACM International Conference on Multimedia (MM), pages 6050–6060, 2022.
- A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016.
- Object scene flow for autonomous vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3061–3070, 2015.
- Using real-time stereo vision for mobile robot navigation. Autonomous Robots, 8:161–171, 2000.
- Non-local spatial propagation network for depth completion. In Proceedings of the European Conference on Computer Vision (ECCV), pages 120–136. Springer, 2020.
- High-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on Pattern Recognition (GCPR), 2014.
- Cfnet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13906–13915, 2021.
- Pcw-net: Pyramid combination and warping cost volume for stereo matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 280–297. Springer, 2022.
- Edgestereo: A context integrated residual pyramid network for stereo matching. In Proceedings of the Asian Conference on Computer Vision (ACCV), pages 20–35. Springer, 2018.
- Vision-based markerless registration using stereo vision and an augmented reality surgical navigation system: a pilot study. BMC Medical Imaging, 15(1):1–11, 2015.
- Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8922–8931, 2021.
- Continuous 3d label stereo matching using local expansion moves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11):2725–2739, 2017.
- Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14362–14372, 2021.
- Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV), pages 402–419. Springer, 2020.
- Falling things: A synthetic dataset for 3d object detection and pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2038–2041, 2018.
- Learning parallax attention for stereo image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12250–12259, 2019.
- Fadnet: A fast and accurate network for disparity estimation. In IEEE International Conference on Robotics and Automation (ICRA), pages 101–107. IEEE, 2020.
- Cspn: Multi-scale cascade spatial pyramid network for object detection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1490–1494. IEEE, 2021.
- Cspn: Multi-scale cascade spatial pyramid network for object detection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1490–1494, 2021.
- Stereo matching with fusing adaptive support weights. IEEE Access, 7:61960–61974, 2019.
- Semantic stereo matching with pyramid cost volumes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7484–7493, 2019.
- Acvnet: Attention concatenation volume for accurate and efficient stereo matching. arXiv preprint arXiv:2203.02146, 2022.
- Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21919–21928, 2023.
- Aanet: Adaptive aggregation network for efficient stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1959–1968, 2020.
- Unifying flow, stereo and depth estimation. arXiv preprint arXiv:2211.05783, 2022.
- Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 899–908, 2019.
- Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res., 17(1):2287–2318, 2016.
- Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Ednet: Efficient disparity estimation with cost volume combination and attention-based spatial residual. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5433–5442, 2021.
- Attention aggregation encoder-decoder network framework for stereo matching. IEEE Signal Processing Letters, 27:760–764, 2020.
- Attention-guided aggregation stereo matching network. Image and Vision Computing, 106:104088, 2021.