MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion (2409.19152v1)

Published 27 Sep 2024 in cs.CV

Abstract: Structure-from-Motion (SfM), a task aiming at jointly recovering camera poses and 3D geometry of a scene given a set of images, remains a hard problem with still many open challenges despite decades of significant progress. The traditional solution for SfM consists of a complex pipeline of minimal solvers which tends to propagate errors and fails when images do not sufficiently overlap, have too little motion, etc. Recent methods have attempted to revisit this paradigm, but we empirically show that they fall short of fixing these core issues. In this paper, we propose instead to build upon a recently released foundation model for 3D vision that can robustly produce local 3D reconstructions and accurate matches. We introduce a low-memory approach to accurately align these local reconstructions in a global coordinate system. We further show that such foundation models can serve as efficient image retrievers without any overhead, reducing the overall complexity from quadratic to linear. Overall, our novel SfM pipeline is simple, scalable, fast and truly unconstrained, i.e. it can handle any collection of images, ordered or not. Extensive experiments on multiple benchmarks show that our method provides steady performance across diverse settings, especially outperforming existing methods in small- and medium-scale settings.

Authors (6)

Bardienus Duisterhof (2 papers)
Philippe Weinzaepfel (38 papers)
Vincent Leroy (18 papers)
Yohann Cabon (18 papers)
Jerome Revaud (21 papers)
Lojze Zust (2 papers)

Citations (3)

View on Semantic Scholar

Summary

Overview of MASt3R-SfM: A Fully-Integrated Solution for Unconstrained Structure-from-Motion

The paper "MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion" presents a novel pipeline for Structure-from-Motion (SfM) tasks leveraging the recently introduced MASt3R model. This work addresses several notable deficiencies in traditional SfM pipelines, particularly those involving error propagation, high computational complexity, and failure in scenarios with minimal image overlap or insufficient camera motion.

Problem Statement

SfM is an enduring challenge in computer vision, tasked with recovering both the 3D geometry of a scene and the camera parameters from a set of 2D images. Classical approaches decompose this problem into several smaller tasks such as keypoint extraction, matching, relative pose estimation, and incremental reconstruction. These steps often necessitate robust outlier rejection mechanisms like RANSAC, and propagate errors through their sequential pipeline, which can lead to failures under specific conditions—most notably, insufficient image overlap or low-motion scenarios.

Contributions

The authors introduce MASt3R-SfM, which builds on the MASt3R model, a foundational model capable of producing local 3D reconstructions and robust matches with minimal computational overhead. The novel aspects of MASt3R-SfM include:

Fully-Integrated Pipeline: The proposed solution eschews the traditional multi-stage approach for a unified, simplified, and scalable pipeline.
Low-Memory, Scalable Reconstruction: Leveraging MASt3R's capabilities, the pipeline reduces overall complexity from quadratic to linear by implementing a low-memory approach for aligning local reconstructions in a global coordinate system.
Flexibility: It handles unconstrained input image collections, from single images to large-scale scenes, thus demonstrating robustness even in purely rotational settings without motion.

Methodology

Sparse Scene Graph Construction: The pipeline starts by creating a scene graph using efficient and scalable image retrieval techniques based on the MASt3R encoder’s features, significantly reducing the number of necessary image pairs for computation.
Local Reconstruction: For each edge in the scene graph, pairwise local 3D reconstructions and matches are generated using the MASt3R model.
Coarse Alignment: Initial alignment of these local pointmaps is achieved by minimizing discrepancies in 3D space using gradient descent, iterating based on extracted sparse correspondences.
Refinement: This coarse alignment is further refined by minimizing 2D reprojection losses, employing a strategy involving anchor points to optimize depth and intrinsic parameters effectively.

Experimental Results

The performance of MASt3R-SfM is evaluated across several benchmarks—Tanks and Temples, ETH3D, CO3Dv2, RealEstate10K, and more. The results demonstrate impressive robustness and efficiency:

Consistency Across Scales: The pipeline exhibits nearly constant performance regardless of the number of input views, significantly outperforming traditional methods (e.g., COLMAP) and recent alternatives (e.g., VGGSfM, FlowMap, ACE-Zero).
Pose Estimation Accuracy: In multi-view settings, especially with sparse or randomly sampled frames, MASt3R-SfM achieves superior relative rotation and translation accuracies compared to other methods.
Unordered Image Collections: On the ETH3D dataset, which features unordered image collections, MASt3R-SfM consistently surpasses other state-of-the-art approaches, showcasing its ability to handle truly unconstrained image collections.

Implications and Future Work

MASt3R-SfM makes substantial strides toward a more robust and scalable SfM solution, resolving several longstanding issues in the field. Its practical implications are significant for applications in navigation, dense multi-view stereo reconstruction, visual localization, and even fields like archaeology.

Future research could explore further optimizations in efficiency and robustness, particularly focusing on handling outliers in purely rotational settings, improving the performance for extremely large scale scenes, and integrating more complex camera models to accommodate various real-world scenarios.

Conclusion

MASt3R-SfM offers a significantly more robust, scalable, and versatile approach to Structure-from-Motion than previous methods. By eliminating the need for complex pipelines and RANSAC, and leveraging powerful foundation models for 3D vision, this work paves the way for more reliable and efficient SfM solutions applicable to diverse and unconstrained image collections.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/ducha_aiki/status/1843223603858596250

https://twitter.com/zhenjun_zhao/status/1841362299308986824