MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion (2504.20040v1)

Published 28 Apr 2025 in cs.CV and cs.RO

Abstract: While Structure-from-Motion (SfM) has seen much progress over the years, state-of-the-art systems are prone to failure when facing extreme viewpoint changes in low-overlap, low-parallax or high-symmetry scenarios. Because capturing images that avoid these pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users. We overcome these limitations by augmenting the classical SfM paradigm with monocular depth and normal priors inferred by deep neural networks. Thanks to a tight integration of monocular and multi-view constraints, our approach significantly outperforms existing ones under extreme viewpoint changes, while maintaining strong performance in standard conditions. We also show that monocular priors can help reject faulty associations due to symmetries, which is a long-standing problem for SfM. This makes our approach the first capable of reliably reconstructing challenging indoor environments from few images. Through principled uncertainty propagation, it is robust to errors in the priors, can handle priors inferred by different models with little tuning, and will thus easily benefit from future progress in monocular depth and normal estimation. Our code is publicly available at https://github.com/cvg/mpsfm.

PDF Abstract

The paper "MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion" (Pataki et al., 28 Apr 2025 ) introduces a novel Structure-from-Motion (SfM) pipeline that enhances traditional methods by incorporating monocular depth and surface normal priors estimated from deep neural networks. The primary goal is to make SfM more robust to challenging scenarios like low-overlap, low-parallax, and high-symmetry scenes, which frequently cause state-of-the-art systems to fail.

Problem: Traditional incremental SfM pipelines, such as COLMAP, rely heavily on the requirement for three-view tracks (3D points observed in at least three images) with sufficient parallax to ensure metric scale consistency and accurate triangulation. This limits their performance and can lead to failure in scenarios common with data captured by non-experts, where image overlap might be minimal, viewpoint changes are extreme, or scenes contain repetitive structures. Learned matching advancements have improved correspondence finding, but downstream reconstruction algorithms still face these fundamental limitations.

Proposed Solution: The authors propose augmenting the classical incremental SfM paradigm with monocular depth and normal priors. By tightly integrating these single-view constraints with multi-view constraints (feature correspondences), their method, MP-SfM, can perform accurate and robust 3D reconstruction even from only two-view tracks, overcoming the need for three-view overlap for scale initialization and structure extension. The method is designed to be robust to noise in the monocular priors by using principled uncertainty propagation and joint optimization.

Methodology:

The MP-SfM pipeline is built upon the COLMAP framework, modifying several key stages:

Inputs: The system takes unordered images, camera intrinsics, monocular depth maps, normal maps (with associated confidence/uncertainty maps), and sparse or dense feature correspondences as input. Off-the-shelf deep networks are used for monocular depth and normal estimation (e.g., Metric3D-v2, DSINE).
Two-View Initialization: Instead of strictly requiring a relative pose estimation with sufficient parallax, the system first attempts this. If it fails (e.g., due to low parallax), it leverages the monocular depth from one image to lift 2D feature points to 3D. A 2D-3D pose estimation (PnP) is then used with correspondences to the second image to initialize the relative pose. Initial 3D points are created by lifting low-parallax inliers with depth and triangulating others. Monocular depth maps are scaled to be consistent with the initial 3D points.
Next View Registration: Images not yet registered are ranked based on correspondences to currently registered images (using either sparse or dense matches). Registration is performed via robust absolute pose estimation using 2D-3D correspondences. Crucially, these 2D-3D correspondences include both triangulated points and points previously lifted from a single view using monocular depth. This allows registration without requiring the new view to have three-way overlap with already triangulated points, enabling reconstruction in low-overlap settings.
Local and Global Refinement: This is the core optimization step. Instead of standard Bundle Adjustment (BA), MP-SfM jointly optimizes camera poses ( $\mathcal{P}$ $P$ ), scene points ( $\mathcal{X}$ $X$ ), and refined, globally consistent depth maps ( $\mathcal{D}^*$ $D^{*}$ ). The objective function combines:
- A standard BA term ( $C_{\mathrm{BA}}$ ) minimizing reprojection errors for multi-view points.
- A depth regularization term ( $C_{\mathrm{reg}}$ ) penalizing deviations between the projected depth of 3D points ( $\hat{D}_i(X_k)$ ) and the refined depth map values ( $D_i^*(x_j)$ ).
- A depth integration term ( $C_{\mathrm{int}}$ ) that enforces consistency between the refined depth map ( $D_i^*$ ) and the monocular depth ( $D_i$ ) and normal ( $N_i$ ) priors. This term uses bilateral normal integration with uncertainty weighting. The optimization is solved using an alternating block coordinate descent, optimizing depth maps per image independently, and then jointly optimizing poses and 3D points, while incorporating uncertainty propagation for robustness to noisy priors.
Depth Consistency Check: After registering a new view, the refined depth map is used for a dense consistency check with overlapping registered views. This involves reprojecting depth maps between views and identifying inconsistent pixels based on a confidence threshold. If a view exhibits too many inconsistencies, it is de-registered. This mechanism helps reject faulty registrations, especially those caused by symmetries, which sparse point checks often miss.

Implementation Details: The system leverages modern feature extractors (SuperPoint) and matchers (LightGlue, RoMA, MASt3R). It uses state-of-the-art monocular depth models (Metric3D-v2, DSINE, Depth Anything v2, Depth Pro), particularly focusing on those providing uncertainty estimates. The depth refinement and integration are implemented on the GPU, while the BA-like step is handled by Ceres Solver on the CPU. Uncertainty propagation from monocular priors is critical, and the paper details how normal uncertainties are handled and how depth uncertainties are calibrated or approximated.

Experimental Evaluation: The paper evaluates MP-SfM extensively on challenging datasets, including ETH3D [schops2017multi], SMERF [duckworth2023smerf], and Tanks and Temples [knapitsch2017tanks] for low-overlap reconstruction, and RealEstate10k [zhou2018stereo] for low-parallax scenarios.

Low-overlap: Experiments show significant improvements over traditional (COLMAP, GLOMAP, SLR) and recent learned (VGG-SfM, MASt3R-SfM) SfM methods, particularly in minimal and low-overlap settings. The ability to use two-view tracks (facilitated by depth priors) is crucial. Performance is consistent whether using sparse (SuperPoint+LightGlue) or dense (RoMA, MASt3R) correspondences.
Low-parallax: MP-SfM largely closes the performance gap between incremental and global SfM methods on the RealEstate10k dataset, outperforming COLMAP and StudioSfM by effectively handling the low-parallax challenge.
Robustness to symmetry: The depth consistency check is shown to be effective in rejecting incorrect registrations in scenes with symmetries (e.g., SMERF dataset), outperforming learning-based methods like Doppelgangers in some cases.
Ablation Studies: Analysis demonstrates the importance of depth refinement (especially using normal priors) and depth regularization. Lifting points using monocular depth is key for low-overlap scenarios but can introduce noise in well-posed scenes. The quality and uncertainty calibration of monocular priors heavily influence the results.
Efficiency: While adding overhead compared to standard COLMAP (primarily from depth refinement and 3D point covariance computation), the depth consistency check is computationally inexpensive, and optimization efficiency is maintained through alternating steps and skipping redundant refinements.

Limitations: The method's performance depends on the reliability of monocular prior uncertainties, which are not always well-calibrated by existing models. Estimating accurate normals for certain surfaces (like vegetation) remains challenging and affects performance. The pipeline still relies on standard components (image retrieval, matching) that can fail in extreme conditions, although future improvements in these areas will directly benefit MP-SfM. The approach increases overall processing time compared to traditional SfM.

Conclusion: MP-SfM successfully integrates monocular depth and surface normal priors into an incremental SfM pipeline, addressing critical failure cases in low-overlap, low-parallax, and high-symmetry scenes. By lifting the requirement for three-view tracks and incorporating dense consistency checks, it achieves significantly improved robustness and accuracy, making SfM more accessible and reliable for non-expert users and challenging real-world data.