RoMa: Robust Dense Feature Matching (2305.15404v2)
Abstract: Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene, and dense methods estimate all such correspondences. The aim is to learn a robust model, i.e., a model able to match under challenging real-world changes. In this work, we propose such a model, leveraging frozen pretrained features from the foundation model DINOv2. Although these features are significantly more robust than local features trained from scratch, they are inherently coarse. We therefore combine them with specialized ConvNet fine features, creating a precisely localizable feature pyramid. To further improve robustness, we propose a tailored transformer match decoder that predicts anchor probabilities, which enables it to express multimodality. Finally, we propose an improved loss formulation through regression-by-classification with subsequent robust regression. We conduct a comprehensive set of experiments that show that our method, RoMa, achieves significant gains, setting a new state-of-the-art. In particular, we achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is provided at https://github.com/Parskatt/RoMa
- HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5173–5182, 2017.
- MAGSAC++, a fast, reliable and accurate robust estimator. In Conference on Computer Vision and Pattern Recognition, 2020.
- Jonathan T Barron. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4331–4339, 2019.
- Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer, 2006.
- The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63(1):75–104, 1996.
- On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International journal of computer vision, 19(1):57–91, 1996.
- A case for using rotation invariant features in state of the art feature matchers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5110–5119, 2022.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression. In Proceedings of the British Machine Vision Conference (BMVC), pages 86.1–86.13. BMVA Press, 2019.
- Improving transformer-based image matching by cascaded capturing spatially informative keypoints. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12129–12139, 2023.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- ASpanFormer: Detector-free image matching with adaptive span transformer. In Proc. European Conference on Computer Vision (ECCV), 2022.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
- Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
- DKM: Dense kernelized feature matching for geometry estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Channel smoothing: Efficient robust smoothing of low-level signal features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(2):209–222, 2006.
- Wasserstein distances for stereo disparity estimation. Advances in Neural Information Processing Systems, 33:22517–22529, 2020.
- Neural reprojection error: Merging feature learning and camera pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 414–423, 2021.
- SiLK: Simple Learned Keypoints. In ICCV, 2023.
- Predicting disparity distributions. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4363–4369. IEEE, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
- Image matching challenge 2022, 2022.
- Jan J Koenderink. The structure of images. Biological cybernetics, 50(5):363–370, 1984.
- Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020.
- Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
- Could giant pre-trained image models extract universal representations? Advances in Neural Information Processing Systems, 35:8332–8346, 2022.
- Tony Lindeberg. Scale-space theory: A basic tool for analyzing structures at different scales. Journal of applied statistics, 21(1-2):225–270, 1994.
- LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023.
- Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5791–5801, 2022.
- David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
- Dgc-net: Dense geometric correspondence network. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1034–1042. IEEE, 2019.
- WxBS: Wide Baseline Stereo Generalizations. In Proceedings of the British Machine Vision Conference. BMVA, 2015.
- Pats: Patch area transportation with subdivision for local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
- DINOv2: Learning robust visual features without supervision. arXiv:2304.07193, 2023.
- Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
- R2d2: Reliable and repeatable detector and descriptor. Advances in neural information processing systems, 32:12405–12415, 2019.
- From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
- Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3247–3257, 2021.
- Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
- LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021.
- Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7199–7209, 2018.
- Quadtree attention for vision transformers. In International Conference on Learning Representations, 2022.
- Prior guided feature enrichment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence, 44(2):1050–1065, 2020.
- Regression by classification. In Advances in Artificial Intelligence, pages 51–60, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg.
- GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network. Advances in Neural Information Processing Systems, 33, 2020a.
- GLU-Net: Global-local universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6258–6268, 2020b.
- Learning accurate dense correspondences and when to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5714–5724, 2021.
- PDC-Net+: Enhanced Probabilistic Dense Correspondence Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- DISK: learning local features with policy gradient. In NeurIPS, 2020.
- Proper reuse of image classification features improves object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13628–13637, 2022.
- MatchFormer: Interleaving attention in transformers for feature matching. In Asian Conference on Computer Vision, 2022.
- Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
- Rule-based regression. In Proceedings of the 13th International Joint Conference on Artificial Intelligence. Chambéry, France, August 28 - September 3, 1993, pages 1072–1078. Morgan Kaufmann, 1993.
- Rule-based machine learning methods for functional prediction. J. Artif. Intell. Res., 3:383–403, 1995.
- Andrew P. Witkin. Scale space filtering. Proc. 8th International Joint on Artificial Intelligence, pages 1091–1022, 1983.
- Revealing the dark secrets of masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475–14485, 2023.
- ASTR: Adaptive spot-guided transformer for consistent local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
- ibot: Image bert pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
- PMatch: Paired masked image modeling for dense geometric matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Johan Edstedt (19 papers)
- Qiyu Sun (71 papers)
- Georg Bökman (16 papers)
- Mårten Wadenbäck (12 papers)
- Michael Felsberg (75 papers)