- The paper introduces Light3R-SfM, a fully learnable feed-forward Structure-from-Motion framework achieving efficient and scalable scene reconstruction from large image collections.
- Its latent global alignment module uses a scalable attention mechanism to implicitly capture multi-view constraints and share global information efficiently.
- It builds a sparse scene graph via a shortest path tree and accumulates pairwise 3D pointmaps globally, reducing memory and computation.
The paper introduces Light3R-SfM, a novel feed-forward framework for Structure-from-Motion (SfM) designed for efficiency and scalability with large, unconstrained image collections. It addresses the limitations of traditional SfM methods that rely on costly matching and global optimization by introducing a latent global alignment module, replacing traditional global optimization with a learnable attention mechanism. The method constructs a sparse scene graph via retrieval-score-guided shortest path tree (SPT) to reduce memory usage and computational overhead.
The key contributions of Light3R-SfM are:
- A fully learnable feed-forward SfM model that directly estimates globally aligned camera poses from unordered image collections, thereby eliminating expensive optimization-based global alignment.
- A latent global alignment module with a scalable attention mechanism that implicitly captures multi-view constraints, enabling global information sharing between features prior to pairwise 3D reconstruction.
The Light3R-SfM pipeline consists of four main stages:
- Encoding: An image encoder extracts per-image feature tokens Fi(0)​=Enc(Ii​), where Ii​∈RH×W×3 is the input image, H and W are the height and width of the image, respectively, p is the patch size of the encoder, and d is the token dimensionality.
- Latent Global Alignment: This module performs implicit global alignment in the latent space using a scalable attention mechanism to globally align image tokens in the feature space.
- It computes a global token gi(0)​∈Rd for each set of image tokens Fi(0)​ via averaging along its spatial dimensions.
- It applies L blocks of the latent global alignment block to achieve global information sharing across all image tokens.
- For each level l∈(0,L), it shares information across all global image tokens Ii​∈RH×W×30 using self-attention defined as Ii​∈RH×W×31.
- It propagates the updated global information to dense image tokens Ii​∈RH×W×32 for each image independently via cross-attention: Ii​∈RH×W×33.
- Finally, it obtains the globally aligned image tokens Ii​∈RH×W×34 via a residual connection, Ii​∈RH×W×35.
- Scene Graph Construction: It constructs a scene graph that maximizes pairwise image similarities using the shortest path tree (SPT) algorithm. The matrix Ii​∈RH×W×36 containing all pairwise cosine similarities is computed as Ii​∈RH×W×37 where Ii​∈RH×W×38 is one-dimensional embedding obtained by average pooling the tokens of each image Ii​∈RH×W×39.
- Decoding and Global Accumulation: The decoding step converts image pairs connected by an edge to pointmaps using a stereo reconstruction decoder. The global reconstruction accumulates pairwise pointmaps by traversing the scene graph to obtain the globally aligned pointmaps, resulting in per-image camera extrinsics H0, intrinsics H1 and dense 3D pointmap at image resolution H2.
- For every edge in the scene graph H3, the decoder outputs two pointmaps and associated confidence maps defined as: H4.
- Per-edge local pointmap predictions are merged into a global one.
- The global point cloud is initialized as H5 and H6.
- Procrustes alignment is used to estimate the optimal rigid body transformation between the two pointmaps: H7.
- The pointmap of node H8 is transformed into the global coordinate frame: H9.
- This is repeated for all edges in W0.
The model is supervised by both pairwise and global losses. The pairwise loss, W1, supervises the pairwise local pointmaps per-edge:
W2, where W3.
W4 are the predicted pointmap, confidence map and the ground-truth pointmap, W5 defines the valid pixels with ground-truth, and W6 regularizes the confidences to not be pushed to W7. The global loss, W8, supervises the transformed global pointmap prediction for each image as W9. The total loss is optimized as p0, with p1.
The method was evaluated on Tanks{paper_content}Temples, CO3Dv2, and Waymo Open Dataset. The evaluation metrics include relative rotation accuracy (RRA), relative translation accuracy (RTA), average translation errors (ATE), and registration rate (Reg.). Results on Tanks{paper_content}Temples show that Light3R-SfM achieves competitive accuracy compared to other learning-based methods and rivals state-of-the-art optimization-based SfM techniques while offering significant improvements in efficiency and scalability. For instance, Light3R-SfM reconstructs a scene of 200 images in 33 seconds, whereas MASt3R-SfM takes approximately 27 minutes. Comparisons with Spann3R demonstrate the superiority of the latent global alignment module, leading to an average of p2 and p3 increase in RRA and RTA scores, respectively.
On the Waymo Open Dataset, Light3R-SfM achieves comparable accuracy to MASt3R-SfM at a lower runtime (p4) and outperforms Spann3R with better accuracy (p5 in RTA@5) at a lower runtime (p6).
Ablation studies validate the impact of each component, including backbone initialization, global supervision, latent alignment, and graph construction strategies.