RoMa: Robust Dense Feature Matching (2305.15404v2)

Published 24 May 2023 in cs.CV

Abstract: Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene, and dense methods estimate all such correspondences. The aim is to learn a robust model, i.e., a model able to match under challenging real-world changes. In this work, we propose such a model, leveraging frozen pretrained features from the foundation model DINOv2. Although these features are significantly more robust than local features trained from scratch, they are inherently coarse. We therefore combine them with specialized ConvNet fine features, creating a precisely localizable feature pyramid. To further improve robustness, we propose a tailored transformer match decoder that predicts anchor probabilities, which enables it to express multimodality. Finally, we propose an improved loss formulation through regression-by-classification with subsequent robust regression. We conduct a comprehensive set of experiments that show that our method, RoMa, achieves significant gains, setting a new state-of-the-art. In particular, we achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is provided at https://github.com/Parskatt/RoMa

References (63)

Authors (5)

Johan Edstedt (19 papers)
Qiyu Sun (71 papers)
Georg Bökman (16 papers)
Mårten Wadenbäck (12 papers)
Michael Felsberg (75 papers)

Citations (36)

View on Semantic Scholar

Summary

Analysis of "RoMa: Robust Dense Feature Matching"

The paper "RoMa: Robust Dense Feature Matching" presents advancements in the field of computer vision, specifically in the task of dense feature matching, which is crucial for accurate 3D reconstruction and visual localization. The authors propose a model named RoMa that brings together several novel components to improve robustness against real-world variations.

Methodology and Contributions

Pre-trained Features from DINOv2: The paper leverages the robust representations from DINOv2, a self-supervised vision model, to enhance the robustness of dense feature matching. By utilizing frozen pre-trained features from DINOv2 for coarse matching, RoMa circumvents the overfitting often seen with models trained from scratch, especially given the limited availability of real-world 3D datasets.
Specialized ConvNet for Fine Features: The authors incorporate a specialized ConvNet to refine the features necessary for precise local matching. By decoupling coarse and fine feature extraction, RoMa can use precise localizability without compromising on robustness.
Transformer Match Decoder: A pivotal innovation is the use of a Transformer-based match decoder that predicts anchor probabilities. This approach allows modeling the multimodal distributions necessary for effective global matching, which improves the robustness to challenging scenarios such as extreme changes in viewpoint and illumination.
Improved Loss Function: The authors propose a twofold loss function — regression-by-classification for coarse global matching, and robust regression for the refinement stage. This separation aligns the training objectives better with the inherent properties of the data at different stages of processing.

Numerical Results and Validation

RoMa sets a new state-of-the-art across several benchmarks:

Achieved a 36% improvement in mean Average Accuracy (mAA) on the challenging WxBS benchmark.
Exhibited improvements in pose estimation tasks on MegaDepth-1500 and ScanNet-1500 datasets, outperforming existing methods.
Demonstrated enhanced performance on the InLoc visual localization benchmark.

Theoretical and Practical Implications

The research emphasizes the importance of combining robust global features with specialized local refining networks to tackle dense feature matching tasks. The use of a regression-by-classification approach for coarse matches indicates a shift in handling the multimodality of matching distributions, a necessary improvement for real-world applications. Furthermore, the Transformer-based decoder's success highlights the increasing relevance of Transformers in computer vision.

Future Directions

The integration of pre-trained features from self-supervised models presents opportunities for expanding RoMa's applicability beyond its current scope. Future research may explore direct training on downstream tasks such as 3D reconstruction, potentially amplifying the model’s utility. Additionally, developing completely unsupervised versions of such models could mitigate the data limitations faced by supervised approaches.

In conclusion, the innovations presented in RoMa enhance the robustness and accuracy of dense feature matching significantly, contributing valuable insights into the construction of more reliable computer vision systems.

PDF Markdown

GitHub

GitHub - Parskatt/RoMa: [CVPR 2024] RoMa: Robust Dense Feature Matching; RoMa is the robust dense feature matcher capable of estimating pixel-dense warps and reliable certainties for almost any image pair. (398 stars)

Tweets

https://twitter.com/Parskatt/status/1750084721818652928