Light3R-SfM: Towards Feed-forward Structure-from-Motion

Published 24 Jan 2025 in cs.CV and cs.LG | (2501.14914v1)

Abstract: We present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. Unlike existing SfM solutions that rely on costly matching and global optimization to achieve accurate 3D reconstructions, Light3R-SfM addresses this limitation through a novel latent global alignment module. This module replaces traditional global optimization with a learnable attention mechanism, effectively capturing multi-view constraints across images for robust and precise camera pose estimation. Light3R-SfM constructs a sparse scene graph via retrieval-score-guided shortest path tree to dramatically reduce memory usage and computational overhead compared to the naive approach. Extensive experiments demonstrate that Light3R-SfM achieves competitive accuracy while significantly reducing runtime, making it ideal for 3D reconstruction tasks in real-world applications with a runtime constraint. This work pioneers a data-driven, feed-forward SfM approach, paving the way toward scalable, accurate, and efficient 3D reconstruction in the wild.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces Light3R-SfM, a fully learnable feed-forward Structure-from-Motion framework achieving efficient and scalable scene reconstruction from large image collections.
Its latent global alignment module uses a scalable attention mechanism to implicitly capture multi-view constraints and share global information efficiently.
It builds a sparse scene graph via a shortest path tree and accumulates pairwise 3D pointmaps globally, reducing memory and computation.

The paper introduces Light3R-SfM, a novel feed-forward framework for Structure-from-Motion (SfM) designed for efficiency and scalability with large, unconstrained image collections. It addresses the limitations of traditional SfM methods that rely on costly matching and global optimization by introducing a latent global alignment module, replacing traditional global optimization with a learnable attention mechanism. The method constructs a sparse scene graph via retrieval-score-guided shortest path tree (SPT) to reduce memory usage and computational overhead.

The key contributions of Light3R-SfM are:

A fully learnable feed-forward SfM model that directly estimates globally aligned camera poses from unordered image collections, thereby eliminating expensive optimization-based global alignment.
A latent global alignment module with a scalable attention mechanism that implicitly captures multi-view constraints, enabling global information sharing between features prior to pairwise 3D reconstruction.

The Light3R-SfM pipeline consists of four main stages:

Encoding: An image encoder extracts per-image feature tokens $F^{(0)}_i = \mathtt{Enc}(\mathcal{I}_i)$ , where $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ is the input image, $H$ and $W$ are the height and width of the image, respectively, $p$ is the patch size of the encoder, and $d$ is the token dimensionality.
Latent Global Alignment: This module performs implicit global alignment in the latent space using a scalable attention mechanism to globally align image tokens in the feature space.
- It computes a global token $g_i^{(0)} \in \mathbb{R}^{d}$ for each set of image tokens $F^{(0)}_i$ via averaging along its spatial dimensions.
- It applies $L$ blocks of the latent global alignment block to achieve global information sharing across all image tokens.
- For each level $l \in (0, L)$ , it shares information across all global image tokens $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 0 using self-attention defined as $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 1.
- It propagates the updated global information to dense image tokens $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 2 for each image independently via cross-attention: $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 3.
- Finally, it obtains the globally aligned image tokens $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 4 via a residual connection, $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 5.
Scene Graph Construction: It constructs a scene graph that maximizes pairwise image similarities using the shortest path tree (SPT) algorithm. The matrix $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 6 containing all pairwise cosine similarities is computed as $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 7 where $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 8 is one-dimensional embedding obtained by average pooling the tokens of each image $\mathcal{I}_i \in \mathbb{R}^{H \times W \times 3}$ 9.
Decoding and Global Accumulation: The decoding step converts image pairs connected by an edge to pointmaps using a stereo reconstruction decoder. The global reconstruction accumulates pairwise pointmaps by traversing the scene graph to obtain the globally aligned pointmaps, resulting in per-image camera extrinsics $H$ $H$ 0, intrinsics $H$ $H$ 1 and dense 3D pointmap at image resolution $H$ $H$ 2.
- For every edge in the scene graph $H$ 3, the decoder outputs two pointmaps and associated confidence maps defined as: $H$ 4.
- Per-edge local pointmap predictions are merged into a global one.
- The global point cloud is initialized as $H$ 5 and $H$ 6.
- Procrustes alignment is used to estimate the optimal rigid body transformation between the two pointmaps: $H$ 7.
- The pointmap of node $H$ 8 is transformed into the global coordinate frame: $H$ 9.
- This is repeated for all edges in $W$ 0.

The model is supervised by both pairwise and global losses. The pairwise loss, $W$ 1, supervises the pairwise local pointmaps per-edge: $W$ 2, where $W$ 3. $W$ 4 are the predicted pointmap, confidence map and the ground-truth pointmap, $W$ 5 defines the valid pixels with ground-truth, and $W$ 6 regularizes the confidences to not be pushed to $W$ 7. The global loss, $W$ 8, supervises the transformed global pointmap prediction for each image as $W$ 9. The total loss is optimized as $p$ 0, with $p$ 1.

The method was evaluated on Tanks{paper_content}Temples, CO3Dv2, and Waymo Open Dataset. The evaluation metrics include relative rotation accuracy (RRA), relative translation accuracy (RTA), average translation errors (ATE), and registration rate (Reg.). Results on Tanks{paper_content}Temples show that Light3R-SfM achieves competitive accuracy compared to other learning-based methods and rivals state-of-the-art optimization-based SfM techniques while offering significant improvements in efficiency and scalability. For instance, Light3R-SfM reconstructs a scene of 200 images in 33 seconds, whereas MASt3R-SfM takes approximately 27 minutes. Comparisons with Spann3R demonstrate the superiority of the latent global alignment module, leading to an average of $p$ 2 and $p$ 3 increase in RRA and RTA scores, respectively. On the Waymo Open Dataset, Light3R-SfM achieves comparable accuracy to MASt3R-SfM at a lower runtime ( $p$ 4) and outperforms Spann3R with better accuracy ( $p$ 5 in RTA@5) at a lower runtime ( $p$ 6). Ablation studies validate the impact of each component, including backbone initialization, global supervision, latent alignment, and graph construction strategies.