MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo (2401.11673v1)
Abstract: Recent advancements in learning-based Multi-View Stereo (MVS) methods have prominently featured transformer-based models with attention mechanisms. However, existing approaches have not thoroughly investigated the profound influence of transformers on different MVS modules, resulting in limited depth estimation capabilities. In this paper, we introduce MVSFormer++, a method that prudently maximizes the inherent characteristics of attention to enhance various components of the MVS pipeline. Formally, our approach involves infusing cross-view information into the pre-trained DINOv2 model to facilitate MVS learning. Furthermore, we employ different attention mechanisms for the feature encoder and cost volume regularization, focusing on feature and spatial aggregations respectively. Additionally, we uncover that some design details would substantially impact the performance of transformer modules in MVS, including normalized 3D positional encoding, adaptive attention scaling, and the position of layer normalization. Comprehensive experiments on DTU, Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the proposed method. Notably, MVSFormer++ achieves state-of-the-art performance on the challenging DTU and Tanks-and-Temples benchmarks.
- Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, 120:153–168, 2016.
- Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2(3):4, 2021.
- Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Riav-mvs: Recurrent-indexing an asymmetric volume for multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 919–928, 2023.
- Improving transformer-based image matching by cascaded capturing spatially informative keypoints. In The IEEE International Conference on Computer Vision (ICCV), 2023.
- Mvsformer: Multi-view stereo by learning robust image features and temperature-based depth. Transactions on Machine Learning Research, 2022.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
- Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, pp. 20–36. Springer, 2022.
- Costformer: Cost transformer for cost aggregation in multi-view stereo. International Joint Conference on Artificial Intelligence, 2023.
- Deep stereo using adaptive thin volume representation with uncertainty awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2524–2534, 2020.
- Overcoming a theoretical limitation of self-attention. arXiv preprint arXiv:2202.12172, 2022.
- Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34:9355–9366, 2021.
- Robert T Collins. A space-sweep approach to true multi-image matching. In Proceedings CVPR IEEE computer society conference on computer vision and pattern recognition, pp. 358–363. Ieee, 1996.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Transmvsnet: Global context-aware multi-view stereo network with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8585–8594, 2022.
- Rethinking optical flow from geometric matching consistent perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1337–1347, 2023.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp. 2793–2803. PMLR, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2020.
- Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009.
- Massively parallel multiview stereopsis by surface normal diffusion. In The IEEE International Conference on Computer Vision (ICCV), June 2015a.
- Massively parallel multiview stereopsis by surface normal diffusion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 873–881, 2015b.
- Gipuma: Massively parallel multi-view stereo reconstruction. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V, 25(361-369):2, 2016.
- Curvature-guided dynamic scale networks for multi-view stereo. In International Conference on Learning Representations (ICLR), 2022.
- Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2495–2504, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Flowformer: A transformer architecture for optical flow. In European Conference on Computer Vision, pp. 668–685. Springer, 2022.
- How much position information do convolutional neural networks encode? In International Conference on Learning Representations (ICLR), 2020.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020.
- Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
- Nr-mvsnet: Learning multi-view stereo based on normal consistency and depth refinement. IEEE Transactions on Image Processing, 2023.
- Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6197–6206, 2021.
- Wt-mvsnet: window-based transformers for multi-view stereo. Advances in Neural Information Processing Systems, 35:8564–8576, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
- Generalized binary search network for highly-efficient multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12991–13000, 2022.
- Dinov2: Learning robust visual features without supervision, 2023.
- Rethinking depth estimation for multi-view stereo: A unified representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8645–8654, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision, pp. 501–518. Springer, 2016.
- A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3260–3269, 2017.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3531–3539, 2021.
- Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1599–1610, 2023a.
- Raymvsnet++: Learning ray-based 1d implicit fields for accurate multi-view stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
- Jianlin Su. Scale operation of attention from the perspective of entropy invariance, Dec 2021. URL https://spaces.ac.cn/archives/8823.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8922–8931, 2021.
- Quadtree attention for vision transformers. In International Conference on Learning Representations (ICLR), 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Teaching matters: Investigating the role of supervision in vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7486–7496, 2023.
- Patchmatchnet: Learned multi-view patchmatch stereo. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14194–14203, 2021a.
- Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1810–1822, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1176. URL https://aclanthology.org/P19-1176.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 568–578, 2021b.
- Mvster: Epipolar transformer for efficient multi-view stereo. In European Conference on Computer Vision, pp. 573–591. Springer, 2022.
- Aa-rmvsnet: Adaptive aggregation recurrent multi-view stereo network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6187–6196, 2021.
- Raymvsnet: Learning ray-based 1d implicit fields for accurate multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8595–8605, 2022.
- Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21919–21928, 2023.
- Learning inverse depth regression for multi-view stereo with correlation cost volume. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 12508–12515, 2020.
- Dense hybrid recurrent multi-view stereo net with dynamic consistency checking. In European conference on computer vision, pp. 674–689. Springer, 2020.
- Non-parametric depth distribution modelling based depth inference for multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8626–8634, 2022.
- Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), pp. 767–783, 2018.
- Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5525–5534, 2019.
- Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1790–1799, 2020.
- Constraining depth map geometry for multi-view stereo: A dual-depth approach with saddle-shaped depth cells. arXiv preprint arXiv:2307.09160, 2023.
- Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819–10829, 2022.
- Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 698–714. Springer, 2020a.
- Visibility-aware multi-view stereo network. In BMVC, 2020b.
- Arai-mvsnet: A multi-view stereo depth estimation network with adaptive depth range and depth interval. Pattern Recognition, 144:109885, 2023a.
- Multi-view stereo representation revist: Region-aware mvsnet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17376–17385, 2023b.
- Geomvsnet: Learning multi-view stereo with geometry perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21508–21518, 2023c.
- Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10323–10333, 2023.