Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo (2401.11673v1)

Published 22 Jan 2024 in cs.CV

Abstract: Recent advancements in learning-based Multi-View Stereo (MVS) methods have prominently featured transformer-based models with attention mechanisms. However, existing approaches have not thoroughly investigated the profound influence of transformers on different MVS modules, resulting in limited depth estimation capabilities. In this paper, we introduce MVSFormer++, a method that prudently maximizes the inherent characteristics of attention to enhance various components of the MVS pipeline. Formally, our approach involves infusing cross-view information into the pre-trained DINOv2 model to facilitate MVS learning. Furthermore, we employ different attention mechanisms for the feature encoder and cost volume regularization, focusing on feature and spatial aggregations respectively. Additionally, we uncover that some design details would substantially impact the performance of transformer modules in MVS, including normalized 3D positional encoding, adaptive attention scaling, and the position of layer normalization. Comprehensive experiments on DTU, Tanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the proposed method. Notably, MVSFormer++ achieves state-of-the-art performance on the challenging DTU and Tanks-and-Temples benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, 120:153–168, 2016.
  2. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2(3):4, 2021.
  3. Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp.  1352–1361. PMLR, 2021.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Riav-mvs: Recurrent-indexing an asymmetric volume for multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  919–928, 2023.
  6. Improving transformer-based image matching by cascaded capturing spatially informative keypoints. In The IEEE International Conference on Computer Vision (ICCV), 2023.
  7. Mvsformer: Multi-view stereo by learning robust image features and temperature-based depth. Transactions on Machine Learning Research, 2022.
  8. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  9. Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, pp.  20–36. Springer, 2022.
  10. Costformer: Cost transformer for cost aggregation in multi-view stereo. International Joint Conference on Artificial Intelligence, 2023.
  11. Deep stereo using adaptive thin volume representation with uncertainty awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2524–2534, 2020.
  12. Overcoming a theoretical limitation of self-attention. arXiv preprint arXiv:2202.12172, 2022.
  13. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34:9355–9366, 2021.
  14. Robert T Collins. A space-sweep approach to true multi-image matching. In Proceedings CVPR IEEE computer society conference on computer vision and pattern recognition, pp.  358–363. Ieee, 1996.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  16. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  17. Transmvsnet: Global context-aware multi-view stereo network with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8585–8594, 2022.
  18. Rethinking optical flow from geometric matching consistent perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1337–1347, 2023.
  19. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp.  2793–2803. PMLR, 2021.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2020.
  21. Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009.
  22. Massively parallel multiview stereopsis by surface normal diffusion. In The IEEE International Conference on Computer Vision (ICCV), June 2015a.
  23. Massively parallel multiview stereopsis by surface normal diffusion. In Proceedings of the IEEE International Conference on Computer Vision, pp.  873–881, 2015b.
  24. Gipuma: Massively parallel multi-view stereo reconstruction. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V, 25(361-369):2, 2016.
  25. Curvature-guided dynamic scale networks for multi-view stereo. In International Conference on Learning Representations (ICLR), 2022.
  26. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2495–2504, 2020.
  27. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  28. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  29. Flowformer: A transformer architecture for optical flow. In European Conference on Computer Vision, pp.  668–685. Springer, 2022.
  30. How much position information do convolutional neural networks encode? In International Conference on Learning Representations (ICLR), 2020.
  31. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  32. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
  33. Nr-mvsnet: Learning multi-view stereo based on normal consistency and depth refinement. IEEE Transactions on Image Processing, 2023.
  34. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  6197–6206, 2021.
  35. Wt-mvsnet: window-based transformers for multi-view stereo. Advances in Neural Information Processing Systems, 35:8564–8576, 2022.
  36. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021.
  37. Generalized binary search network for highly-efficient multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12991–13000, 2022.
  38. Dinov2: Learning robust visual features without supervision, 2023.
  39. Rethinking depth estimation for multi-view stereo: A unified representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8645–8654, 2022.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  41. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  42. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision, pp.  501–518. Springer, 2016.
  43. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3260–3269, 2017.
  44. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  45. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  3531–3539, 2021.
  46. Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1599–1610, 2023a.
  47. Raymvsnet++: Learning ray-based 1d implicit fields for accurate multi-view stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  48. Jianlin Su. Scale operation of attention from the perspective of entropy invariance, Dec 2021. URL https://spaces.ac.cn/archives/8823.
  49. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  50. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8922–8931, 2021.
  51. Quadtree attention for vision transformers. In International Conference on Learning Representations (ICLR), 2022.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  53. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  54. Teaching matters: Investigating the role of supervision in vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7486–7496, 2023.
  55. Patchmatchnet: Learned multi-view patchmatch stereo. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  14194–14203, 2021a.
  56. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  1810–1822, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1176. URL https://aclanthology.org/P19-1176.
  57. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  568–578, 2021b.
  58. Mvster: Epipolar transformer for efficient multi-view stereo. In European Conference on Computer Vision, pp.  573–591. Springer, 2022.
  59. Aa-rmvsnet: Adaptive aggregation recurrent multi-view stereo network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6187–6196, 2021.
  60. Raymvsnet: Learning ray-based 1d implicit fields for accurate multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8595–8605, 2022.
  61. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21919–21928, 2023.
  62. Learning inverse depth regression for multi-view stereo with correlation cost volume. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  12508–12515, 2020.
  63. Dense hybrid recurrent multi-view stereo net with dynamic consistency checking. In European conference on computer vision, pp.  674–689. Springer, 2020.
  64. Non-parametric depth distribution modelling based depth inference for multi-view stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8626–8634, 2022.
  65. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), pp.  767–783, 2018.
  66. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5525–5534, 2019.
  67. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1790–1799, 2020.
  68. Constraining depth map geometry for multi-view stereo: A dual-depth approach with saddle-shaped depth cells. arXiv preprint arXiv:2307.09160, 2023.
  69. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10819–10829, 2022.
  70. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp.  698–714. Springer, 2020a.
  71. Visibility-aware multi-view stereo network. In BMVC, 2020b.
  72. Arai-mvsnet: A multi-view stereo depth estimation network with adaptive depth range and depth interval. Pattern Recognition, 144:109885, 2023a.
  73. Multi-view stereo representation revist: Region-aware mvsnet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17376–17385, 2023b.
  74. Geomvsnet: Learning multi-view stereo with geometry perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21508–21518, 2023c.
  75. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10323–10333, 2023.
Citations (10)

Summary

  • The paper introduces MVSFormer++, which refines transformer-based MVS with tailored attention mechanisms and detailed design optimizations.
  • It leverages pre-trained DINOv2 and Side View Attention to enhance feature extraction and cross-view information aggregation.
  • Empirical results on DTU and Tanks-and-Temples benchmarks validate its state-of-the-art performance in depth estimation.

Introduction

The pursuit of robust Multi-View Stereo (MVS) models has long been a focal point in Computer Vision. Recent transformer-based MVS models, such as MVSFormer, have united pre-trained Vision Transformers (ViTs) for feature extraction with integrated architectures and training strategies, thereby setting new benchmarks in the domain. Despite these advancements, the integration and fine-tuning of transformers for different MVS modules such as the feature encoder and cost volume regularization have remained largely an open question in research.

Enhancements of the Transformer in MVS

The newly introduced MVSFormer++ method innovates by enhancing the aforementioned components and addressing the nuanced details of transformer design previously unexplored in MVS context. This new approach systematically explores different attention mechanisms for the feature encoder and cost volume regularization. For example, the paper emphasizes feature-level aggregation using linear attention for the feature encoder and spatial aggregation via vanilla attention for the cost volume. Notably, the research uncovers the impact of subtle design choices such as normalized positional encoding, adaptive attention scaling, and the position of layer normalization, which have a profound influence on the transformer's performance in the context of MVS.

Design Details and Empirical Results

MVSFormer++ integrates the pre-trained DINOv2 as a feature encoder and employs Side View Attention (SVA) to incorporate cross-view information, fundamentally enhancing depth estimation accuracy. Another design advancement is the inclusion of 3D Frustoconical Positional Encoding (FPE) for cost volume regularization, improving the transformer's capacity to handle extended 3D sequences of diverse lengths. Adaptations such as Adaptive Attention Scaling (AAS) also aid in mitigating the attention dilution problem, which is critical for dealing with higher-resolution images. Empirical validation on benchmarks such as DTU and Tanks-and-Temples showcases the model's state-of-the-art performance, solidifying its standing in MVS research.

Impact and Future Directions

The introduction and strategic implementation of MVSFormer++ mark a significant step forward in MVS learning. The tailored attention mechanisms and deep dive into transformer design specifics push the boundaries of what’s possible in depth estimation tasks. Future work in this area may include further refinement of the transformer’s attention mechanisms for different MVS components, potentially leading to increasingly accurate and robust MVS models. Given MVSFormer++'s performance on various benchmarks, it's likely to have lasting implications for applications in 3D reconstruction and beyond.

X Twitter Logo Streamline Icon: https://streamlinehq.com