DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion
(2505.05473v1)
Published 8 May 2025 in cs.CV
Abstract: Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.
Summary
The paper introduces a unified end-to-end framework that directly infers 3D scene geometry and camera poses from multi-view images using a diffusion model.
It employs a transformer-based denoising diffusion approach to predict pixel-wise ray origins and endpoints, effectively handling unbounded coordinates and incomplete depth data.
Experimental results on CO3D, Habitat, and RealEstate10k demonstrate improved camera pose accuracy and competitive geometry reconstruction versus traditional and learning-based methods.
DiffusionSfM (2505.05473) proposes an end-to-end, data-driven approach to Structure-from-Motion (SfM) that directly infers 3D scene geometry and camera poses from a set of multi-view images. Unlike traditional SfM pipelines and many recent learning-based methods that rely on a two-stage process involving pairwise reasoning (like feature matching or pairwise 3D pointmap prediction) followed by global optimization (like bundle adjustment), DiffusionSfM unifies these steps into a single multi-view reasoning framework.
The core idea is to represent scene geometry and cameras using pixel-wise ray origins and endpoints in a global coordinate frame. For each pixel in an image, the ray origin corresponds to the camera center, and the ray endpoint is the 3D point on the object's surface observed by that pixel. The task then becomes predicting these dense ray origins and endpoints for all pixels across all input images.
DiffusionSfM employs a transformer-based denoising diffusion model [peebles2023scalable] to predict these ray origins and endpoints. The model takes a set of multi-view images as input. It utilizes features extracted by a powerful vision backbone like DINOv2 [oquab2023dinov2]. Noisy versions of the ground truth ray origins and endpoints are embedded and concatenated with the image features to serve as input to a Diffusion Transformer (DiT) architecture. The DiT processes this multi-view information using self-attention, reasoning jointly across different image patches and views. A Dense Prediction Transformer (DPT) decoder then upsamples the DiT's low-resolution features to produce dense, pixel-aligned ray origins and endpoints.
The authors highlight several practical challenges in training this diffusion model for SfM and introduce specific mechanisms to address them:
Unbounded Scene Coordinates: 3D coordinates can vary significantly in scale across and within scenes, which is problematic for diffusion models that typically work with normalized data. To handle this, ray origins and endpoints are represented in homogeneous coordinates (x,y,z,w) and normalized to have unit norm x2+y2+z2+w2=1. This provides a bounded representation that includes points at infinity and stabilizes training.
Incomplete Ground Truth: Real-world datasets often have missing depth information (e.g., sparse point clouds). Since diffusion models require noisy ground truth as input during training, pixels with missing data pose a challenge. DiffusionSfM addresses this by conditioning the model on ground truth masks that indicate valid pixels. During training, noisy rays are multiplied element-wise by these masks, and the diffusion loss is only computed for unmasked pixels. During inference, the masks are set to all ones, allowing the model to predict for all pixels.
Training Efficiency: Training the high-resolution (dense) model directly can be slow. The authors propose a sparse-to-dense training strategy:
First, train a sparse model that outputs patch-wise ray origins and endpoints at the same resolution as the DINOv2 features, without the DPT decoder.
Initialize the dense model's DiT weights from the trained sparse model.
Warm up the dense model by training only the embedding layer and the DPT decoder while freezing the DiT.
Finally, fine-tune the entire dense model, including the DINOv2 encoder. This approach improves convergence and performance.
In experiments on datasets like CO3D [reizenstein2021common], Habitat [savva2019habitat], and RealEstate10k [zhou2018stereo], DiffusionSfM demonstrates strong performance compared to classical and learning-based baselines, including DUSt3R [wang2023DUSt3R] and RayDiffusion [zhang2024cameras]. It achieves higher accuracy for camera pose estimation, particularly for camera centers, and provides competitive results for geometry reconstruction (measured by Chamfer Distance). The explicit prediction of ray origins (camera centers) is credited for the improved camera center accuracy.
A significant advantage of using a diffusion model is its ability to model uncertainty and generate multiple plausible interpretations of the scene when the input views are ambiguous, as demonstrated qualitatively with examples where different valid 3D layouts are possible.
For inference, while diffusion models are iterative, DiffusionSfM can achieve high accuracy with a small number of denoising steps (e.g., 10 steps), outperforming predictions from the final step (T=0) by stopping early (e.g., at T=90). This makes its inference speed competitive with, or even faster than, traditional pipelines like DUSt3R's pairwise inference followed by global alignment.
The paper acknowledges limitations, such as the computational cost scaling quadratically with the number of input images for the Transformer architecture (suggesting masked attention for scaling to large sets), and potential benefits from using a latent diffusion model instead of pixel-space diffusion for efficiency.
Overall, DiffusionSfM presents a compelling end-to-end approach for SfM, unifying structure and motion prediction using a diffusion model and introducing practical techniques for training with real-world data challenges like missing ground truth and unbounded coordinates.