Playing to Vision Foundation Model's Strengths in Stereo Matching (2404.06261v1)
Abstract: Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.
- A. Kirillov et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026.
- M. Oquab et al., “DINOv2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- L. Yang et al., “Depth anything: Unleashing the power of large-scale unlabeled data,” arXiv preprint arXiv:2401.10891, 2024.
- X. Guo et al., “Group-wise correlation stereo network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3273–3282.
- G. Xu et al., “Iterative geometry encoding volume for stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 21 919–21 928.
- H. Xu et al., “Unifying flow, stereo and depth estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 13 941–13 958, 2023.
- K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- M. Sandler and pthers, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
- Y. Li et al., “Benchmarking detection transfer learning with vision Transformers,” arXiv preprint arXiv:2111.11429, 2021.
- Y. Li et al., “Exploring plain vision Transformer backbones for object detection,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 280–296.
- Z. Chen et al., “Vision Transformer adapter for dense predictions,” in International Conference on Learning Representations (ICLR), 2023.
- J. Li et al., “Practical stereo matching via cascaded recurrent network with adaptive correlation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 263–16 272.
- W. Guo et al., “Context-enhanced stereo Transformer,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 263–279.
- Z. Liu et al., “Global occlusion-aware Transformer for robust stereo matching,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 3535–3544.
- P. Weinzaepfel et al., “CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17 969–17 980.
- P. Weinzaepfel et al., “CroCo: Self-supervised pre-training for 3D vision tasks by cross-view completion,” Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 3502–3516, 2022.
- R. Ranftl et al., “Vision Transformers for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12 179–12 188.
- D. Eigen et al., “Depth map prediction from a single image using a multi-scale deep network,” Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014.
- L. Lipson et al., “RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,” in 2021 International Conference on 3D Vision (3DV). IEEE, 2021, pp. 218–227.
- Z. Shen et al., “CFNet: Cascade and fused cost volume for robust stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13 906–13 915.
- Z. Zhou and Q. Dong, “Self-distilled feature aggregation for self-supervised monocular depth estimation,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 709–726.
- Y. Zhao et al., “Semantic-aligned fusion Transformer for one-shot object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 7601–7611.
- A. Geiger et al., “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp. 3354–3361.
- Z. Xie et al., “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 16 684–16 693.
- X. Wang et al., “Dense contrastive learning for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3024–3033.
- J. Sun et al., “LoFTR: Detector-free local feature matching with Transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8922–8931.
- J. Deng et al., “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2009, pp. 248–255.
- W. Wang et al., “PVT v2: Improved baselines with pyramid vision Transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
- Y. Quan et al., “Centralized feature pyramid for object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 4341–4354, 2023.
- Z. Ma et al., “Multiview stereo with cascaded epipolar raft,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 734–750.
- G. Xu et al., “Accurate and efficient stereo matching via attention concatenation volume,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13, 2023.
- A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
- N. Park and S. Kim, “How do vision Transformers work?” in International Conference on Learning Representations (ICLR), 2021.
- Y. Fang et al., “Unleashing vanilla vision Transformer with masked image modeling for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 6244–6253.
- G. Wang et al., “Cross-level attentive feature aggregation for change detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2023, doi: 10.1109/TCSVT.2023.3344092.
- R. Fan et al., “SNE-RoadSeg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 340–356.
- J.-R. Chang et al., “Attention-aware feature aggregation for real-time stereo matching on edge devices,” in Proceedings of the Asian Conference on Computer Vision (ACCV), 2020.
- W. Liu et al., “Learning to upsample by learning to sample,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 6027–6037.
- Y. Liu et al., “The devil is in the upsampling: Architectural decisions made simpler for denoising with deep image prior,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 12 408–12 417.
- J. Hu et al., “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.
- L. Gao et al., “Doubly-fused ViT: Fuse information from vision Transformer doubly with local representation,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 744–761.
- Q. Su and S. Ji, “ChiTransformer: Towards reliable stereo from cues,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 1939–1949.
- N. Mayer et al., “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4040–4048.
- M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3061–3070.
- D. Scharstein and C. Pal, “Learning conditional random fields for stereo,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2007, pp. 1–8.
- H. Hirschmuller and D. Scharstein, “Evaluation of cost functions for stereo matching,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2007, pp. 1–8.
- D. Scharstein et al., “High-resolution stereo datasets with subpixel-accurate ground truth,” in Pattern Recognition: 36th German Conference (GCPR). Springer, 2014, pp. 31–42.
- T. Schops et al., “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3260–3269.
- X. Cheng et al., “Hierarchical neural architecture search for deep stereo matching,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 22 158–22 169, 2020.
- V. Tankovich et al., “HITnet: Hierarchical iterative tile refinement network for real-time stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 14 362–14 372.
- G. Xu et al., “Attention concatenation volume for accurate and efficient stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12 981–12 990.
- Z. Shen et al., “PCW-Net: Pyramid combination and warping cost volume for stereo matching,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 280–297.
- Z. Shen et al., “Digging into uncertainty-based pseudo-label for robust stereo matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 14 301–14 320, 2023.
- O.-H. Kwon and E. Zell, “Image-coupled volume propagation for stereo matching,” in 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023, pp. 2510–2514.
- M. Feng et al., “MC-Stereo: Multi-peak lookup and cascade search range for stereo matching,” arXiv preprint arXiv:2311.02340, 2023.
- T. Guan et al., “Neural Markov random field for stereo matching,” arXiv preprint arXiv:2403.11193, 2024.
- X. Guo et al., “OpenStereo: A comprehensive benchmark for stereo matching and strong baseline,” arXiv preprint arXiv:2312.00343, 2023.
- Chuang-Wei Liu (6 papers)
- Qijun Chen (49 papers)
- Rui Fan (114 papers)