Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoMa: Robust Dense Feature Matching (2305.15404v2)

Published 24 May 2023 in cs.CV

Abstract: Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene, and dense methods estimate all such correspondences. The aim is to learn a robust model, i.e., a model able to match under challenging real-world changes. In this work, we propose such a model, leveraging frozen pretrained features from the foundation model DINOv2. Although these features are significantly more robust than local features trained from scratch, they are inherently coarse. We therefore combine them with specialized ConvNet fine features, creating a precisely localizable feature pyramid. To further improve robustness, we propose a tailored transformer match decoder that predicts anchor probabilities, which enables it to express multimodality. Finally, we propose an improved loss formulation through regression-by-classification with subsequent robust regression. We conduct a comprehensive set of experiments that show that our method, RoMa, achieves significant gains, setting a new state-of-the-art. In particular, we achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is provided at https://github.com/Parskatt/RoMa

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5173–5182, 2017.
  2. MAGSAC++, a fast, reliable and accurate robust estimator. In Conference on Computer Vision and Pattern Recognition, 2020.
  3. Jonathan T Barron. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4331–4339, 2019.
  4. Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer, 2006.
  5. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63(1):75–104, 1996.
  6. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International journal of computer vision, 19(1):57–91, 1996.
  7. A case for using rotation invariant features in state of the art feature matchers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5110–5119, 2022.
  8. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  9. Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression. In Proceedings of the British Machine Vision Conference (BMVC), pages 86.1–86.13. BMVA Press, 2019.
  10. Improving transformer-based image matching by cascaded capturing spatially informative keypoints. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12129–12139, 2023.
  11. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  12. ASpanFormer: Detector-free image matching with adaptive span transformer. In Proc. European Conference on Computer Vision (ECCV), 2022.
  13. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  14. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
  15. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  16. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  17. DKM: Dense kernelized feature matching for geometry estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  18. Channel smoothing: Efficient robust smoothing of low-level signal features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(2):209–222, 2006.
  19. Wasserstein distances for stereo disparity estimation. Advances in Neural Information Processing Systems, 33:22517–22529, 2020.
  20. Neural reprojection error: Merging feature learning and camera pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 414–423, 2021.
  21. SiLK: Simple Learned Keypoints. In ICCV, 2023.
  22. Predicting disparity distributions. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4363–4369. IEEE, 2021.
  23. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  24. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
  25. Image matching challenge 2022, 2022.
  26. Jan J Koenderink. The structure of images. Biological cybernetics, 50(5):363–370, 1984.
  27. Hierarchical scene coordinate classification and regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11983–11992, 2020.
  28. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
  29. Could giant pre-trained image models extract universal representations? Advances in Neural Information Processing Systems, 35:8332–8346, 2022.
  30. Tony Lindeberg. Scale-space theory: A basic tool for analyzing structures at different scales. Journal of applied statistics, 21(1-2):225–270, 1994.
  31. LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023.
  32. Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5791–5801, 2022.
  33. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  34. Dgc-net: Dense geometric correspondence network. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1034–1042. IEEE, 2019.
  35. WxBS: Wide Baseline Stereo Generalizations. In Proceedings of the British Machine Vision Conference. BMVA, 2015.
  36. Pats: Patch area transportation with subdivision for local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
  37. DINOv2: Learning robust visual features without supervision. arXiv:2304.07193, 2023.
  38. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
  39. R2d2: Reliable and repeatable detector and descriptor. Advances in neural information processing systems, 32:12405–12415, 2019.
  40. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
  41. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
  42. Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3247–3257, 2021.
  43. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  44. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021.
  45. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7199–7209, 2018.
  46. Quadtree attention for vision transformers. In International Conference on Learning Representations, 2022.
  47. Prior guided feature enrichment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence, 44(2):1050–1065, 2020.
  48. Regression by classification. In Advances in Artificial Intelligence, pages 51–60, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg.
  49. GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network. Advances in Neural Information Processing Systems, 33, 2020a.
  50. GLU-Net: Global-local universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6258–6268, 2020b.
  51. Learning accurate dense correspondences and when to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5714–5724, 2021.
  52. PDC-Net+: Enhanced Probabilistic Dense Correspondence Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  53. DISK: learning local features with policy gradient. In NeurIPS, 2020.
  54. Proper reuse of image classification features improves object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13628–13637, 2022.
  55. MatchFormer: Interleaving attention in transformers for feature matching. In Asian Conference on Computer Vision, 2022.
  56. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  57. Rule-based regression. In Proceedings of the 13th International Joint Conference on Artificial Intelligence. Chambéry, France, August 28 - September 3, 1993, pages 1072–1078. Morgan Kaufmann, 1993.
  58. Rule-based machine learning methods for functional prediction. J. Artif. Intell. Res., 3:383–403, 1995.
  59. Andrew P. Witkin. Scale space filtering. Proc. 8th International Joint on Artificial Intelligence, pages 1091–1022, 1983.
  60. Revealing the dark secrets of masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475–14485, 2023.
  61. ASTR: Adaptive spot-guided transformer for consistent local feature matching. In The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
  62. ibot: Image bert pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
  63. PMatch: Paired masked image modeling for dense geometric matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Johan Edstedt (19 papers)
  2. Qiyu Sun (71 papers)
  3. Georg Bökman (16 papers)
  4. Mårten Wadenbäck (12 papers)
  5. Michael Felsberg (75 papers)
Citations (36)

Summary

Analysis of "RoMa: Robust Dense Feature Matching"

The paper "RoMa: Robust Dense Feature Matching" presents advancements in the field of computer vision, specifically in the task of dense feature matching, which is crucial for accurate 3D reconstruction and visual localization. The authors propose a model named RoMa that brings together several novel components to improve robustness against real-world variations.

Methodology and Contributions

  1. Pre-trained Features from DINOv2: The paper leverages the robust representations from DINOv2, a self-supervised vision model, to enhance the robustness of dense feature matching. By utilizing frozen pre-trained features from DINOv2 for coarse matching, RoMa circumvents the overfitting often seen with models trained from scratch, especially given the limited availability of real-world 3D datasets.
  2. Specialized ConvNet for Fine Features: The authors incorporate a specialized ConvNet to refine the features necessary for precise local matching. By decoupling coarse and fine feature extraction, RoMa can use precise localizability without compromising on robustness.
  3. Transformer Match Decoder: A pivotal innovation is the use of a Transformer-based match decoder that predicts anchor probabilities. This approach allows modeling the multimodal distributions necessary for effective global matching, which improves the robustness to challenging scenarios such as extreme changes in viewpoint and illumination.
  4. Improved Loss Function: The authors propose a twofold loss function — regression-by-classification for coarse global matching, and robust regression for the refinement stage. This separation aligns the training objectives better with the inherent properties of the data at different stages of processing.

Numerical Results and Validation

RoMa sets a new state-of-the-art across several benchmarks:

  • Achieved a 36% improvement in mean Average Accuracy (mAA) on the challenging WxBS benchmark.
  • Exhibited improvements in pose estimation tasks on MegaDepth-1500 and ScanNet-1500 datasets, outperforming existing methods.
  • Demonstrated enhanced performance on the InLoc visual localization benchmark.

Theoretical and Practical Implications

The research emphasizes the importance of combining robust global features with specialized local refining networks to tackle dense feature matching tasks. The use of a regression-by-classification approach for coarse matches indicates a shift in handling the multimodality of matching distributions, a necessary improvement for real-world applications. Furthermore, the Transformer-based decoder's success highlights the increasing relevance of Transformers in computer vision.

Future Directions

The integration of pre-trained features from self-supervised models presents opportunities for expanding RoMa's applicability beyond its current scope. Future research may explore direct training on downstream tasks such as 3D reconstruction, potentially amplifying the model’s utility. Additionally, developing completely unsupervised versions of such models could mitigate the data limitations faced by supervised approaches.

In conclusion, the innovations presented in RoMa enhance the robustness and accuracy of dense feature matching significantly, contributing valuable insights into the construction of more reliable computer vision systems.

X Twitter Logo Streamline Icon: https://streamlinehq.com