Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 34 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 130 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation (2503.14483v1)

Published 18 Mar 2025 in cs.CV

Abstract: In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at https://zju3dv.github.io/murre/ .

Summary

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation

The paper introduces "Murre," a novel framework aimed at enhancing multi-view 3D reconstruction by leveraging SfM-guided monocular depth estimation. This approach departs from traditional multi-view stereo (MVS) methods that suffer from high memory demands and limited performance in sparse view scenarios. By integrating Structure from Motion (SfM) priors into diffusion-based depth estimation, Murre addresses these two critical issues while maintaining high reconstruction quality and generalization capability.

Methodology and Contributions

The proposed method operates in a multi-phase pipeline that integrates SfM and diffusion-based monocular depth estimation for reconstructing 3D scenes. Initially, the method extracts sparse SfM point clouds from the input images, encapsulating the global scene structure. These point clouds guide the conditional diffusion model to predict multi-view consistent depth maps by bypassing the conventional multi-view matching process, traditionally used in MVS methods.

The notable design innovation presented in Murre is leveraging SfM point clouds as intermediate explicit representations, which effectively integrate multi-view information into the depth estimation task. This integration is achieved by converting the SfM point cloud into a sparse depth map, which is then densified and used to condition the monocular depth estimation. The method's reliance on this novel approach results in a model that aligns depth estimation with robust global scale accuracy and consistency.

Experimentally, Murre significantly outperforms state-of-the-art MVS and implicit neural reconstruction models across diverse datasets, including DTU, ScanNet, Replica, Waymo, and UrbanScene3D, establishing its effectiveness across various real-world scenarios. The model achieves superior performance in complex environments and demonstrates resilience to low-texture regions and sparse viewpoints.

Implications and Future Prospects

This research illustrates remarkable progress in resolving inherent challenges in image-based 3D reconstruction, such as memory inefficiency and scale ambiguity in depth estimation. By presenting a pipeline that combines diffusion models fine-tuned on synthetic data with established SfM techniques, this work opens new avenues for efficient, scalable, and robust 3D scene reconstructions.

From a practical standpoint, the integration of diffusion models with SfM broadens the application potential — extending from VR and AR fields to enhancing autonomous systems — where dense, accurate scene reconstructions from limited data are crucial. The model's ability to function effectively with minimal training data adds to its appeal, suggesting pathways for developing models that can generalize across vastly different environments.

The paper hints at future exploration in areas with extremely sparse view setups where current methods might still falter. Additionally, extending the model's capacity to handle dynamic elements in scenes could be essential for comprehensive real-time applications. As the landscape of large-scale synthetic data and foundational models continues to grow, the synthesis of such techniques with data-centric paradigms will likely propel further advancements in 3D computer vision.

Overall, Murre represents a significant methodological stride in multi-view 3D reconstruction, setting a precedent for combining robust traditional techniques with innovative deep learning models to overcome existing limitations in scene reconstruction quality and generalization.