Grounding Image Matching in 3D with MASt3R (2406.09756v1)

Published 14 Jun 2024 in cs.CV

Abstract: Image Matching is a core component of all best-performing algorithms and pipelines in 3D vision. Yet despite matching being fundamentally a 3D problem, intrinsically linked to camera pose and scene geometry, it is typically treated as a 2D problem. This makes sense as the goal of matching is to establish correspondences between 2D pixel fields, but also seems like a potentially hazardous choice. In this work, we take a different stance and propose to cast matching as a 3D task with DUSt3R, a recent and powerful 3D reconstruction framework based on Transformers. Based on pointmaps regression, this method displayed impressive robustness in matching views with extreme viewpoint changes, yet with limited accuracy. We aim here to improve the matching capabilities of such an approach while preserving its robustness. We thus propose to augment the DUSt3R network with a new head that outputs dense local features, trained with an additional matching loss. We further address the issue of quadratic complexity of dense matching, which becomes prohibitively slow for downstream applications if not carefully treated. We introduce a fast reciprocal matching scheme that not only accelerates matching by orders of magnitude, but also comes with theoretical guarantees and, lastly, yields improved results. Extensive experiments show that our approach, coined MASt3R, significantly outperforms the state of the art on multiple matching tasks. In particular, it beats the best published methods by 30% (absolute improvement) in VCRE AUC on the extremely challenging Map-free localization dataset.

Authors (3)

Vincent Leroy (18 papers)
Yohann Cabon (18 papers)
Jérôme Revaud (5 papers)

Citations (35)

View on Semantic Scholar

Summary

The paper presents MASt3R, a novel 3D image matching method that improves accuracy and robustness in camera pose estimation and scene reconstruction.
It extends the DUSt3R framework by integrating a dense local feature head with InfoNCE loss and a fast reciprocal matching strategy.
Experimental results show a 30% improvement in VCRE AUC and a median translation error reduction to 36 cm, advancing visual localization and 3D reconstruction.

Grounding Image Matching in 3D with MASt3R

The paper "Grounding Image Matching in 3D with MASt3R" by Leroy, Cabon, and Revaud, introduces a significant advancement in the field of 3D vision by proposing a novel approach to image matching through the MASt3R framework. This framework augments the existing DUSt3R framework to improve the accuracy and robustness of image matching tasks, crucial for applications such as camera pose estimation and 3D scene reconstruction.

Overview of MASt3R

The core idea of MASt3R lies in treating the image matching problem as inherently 3D, which contrasts with traditional methods that handle it in 2D space. The paper argues convincingly that because image matching must accommodate the 3D geometry of scenes and the 3D pose of the camera, a 3D approach should be naturally more robust and accurate.

MASt3R extends DUSt3R by introducing a new network head that outputs dense local features, trained with an additional matching loss. This design allows the framework to yield highly accurate and robust matches. The paper further addresses the quadratic complexity issue of dense matching by introducing a fast reciprocal matching scheme, which accelerates the process significantly while improving results.

Methodology

DUSt3R Framework

The authors begin by leveraging DUSt3R, a robust but moderately accurate 3D reconstruction framework that uses transformers for pointmap regression. DUSt3R's main strength lies in its ability to handle extreme viewpoint changes, though with limited matching precision.

Enhancements with MASt3R

MASt3R builds upon DUSt3R by incorporating a secondary head to regress dense local feature maps, optimized with an InfoNCE loss that encourages pixel-level accuracy in matches. A coarse-to-fine matching scheme, combined with a fast algorithm for finding reciprocal matches, further refines this process.

Experimental Results

The experimental evaluation of MASt3R demonstrates its superior performance across multiple benchmarks:

Map-free Localization:
- MASt3R significantly outperforms the state-of-the-art, achieving a 30% absolute improvement in VCRE AUC on the challenging Map-free localization dataset.
- The median translation error is notably reduced to 36 cm.
Relative Pose Estimation:
- On the CO3Dv2 and RealEstate10k datasets, MASt3R outperforms recent data-driven methods in terms of relative rotation accuracy (RRA) and relative translation accuracy (RTA).
- It achieves a mean Average Accuracy (mAA) improvement by at least 8.7 points over the best multi-view methods on RealEstate10k.
Visual Localization:
- On the Aachen Day-Night and InLoc datasets, MASt3R demonstrates robust performance, especially on InLoc where it significantly outperforms previous methods.
- The system remains competitive even with a single retrieved image, showcasing its robustness.
Multiview 3D Reconstruction:
- Without any domain-specific training, MASt3R achieves competitive performance on the DTU dataset, outperforming DUSt3R and aligning closely with the best methods.

Discussion

MASt3R's approach to grounding image matching in 3D space presents several notable advantages:

Robustness: The framework remains reliable under extreme viewpoint changes, repetitive patterns, and varying lighting conditions.
Accuracy: By focusing on pixel-accurate matches through InfoNCE loss, the method achieves higher precision than traditional keypoint-based or dense matching methods.
Efficiency: The fast reciprocal matching algorithm greatly reduces computational costs without sacrificing match quality.

Implications and Future Directions

Practically, MASt3R opens new possibilities in visual localization, navigation, robotics, and photogrammetry due to its robustness and accuracy. Theoretically, it represents a shift towards more holistic and geometry-aware approaches in computer vision tasks. Future developments could explore enhancing the computational efficiency further, improving the scalability to even higher resolutions, and expanding the applications to more diverse scene types and environmental conditions.

In conclusion, the MASt3R framework sets a new benchmark in image matching tasks by fundamentally integrating 3D geometry into the matching process, showcasing significant improvements over existing methods in both accuracy and robustness. This approach offers a promising direction for future research and applications in 3D vision.

PDF Markdown

Related Papers

Tweets

https://twitter.com/naverlabseurope/status/1811441837523026005

https://twitter.com/ducha_aiki/status/1806239354509001088

https://twitter.com/zhenjun_zhao/status/1802564685574926677

https://twitter.com/dlarlus/status/1841100653692617110

https://twitter.com/_vztu/status/1807181376225190368