FastMap: Revisiting Dense and Scalable Structure from Motion (2505.04612v2)

Published 7 May 2025 in cs.CV

Abstract: We propose FastMap, a new global structure from motion method focused on speed and simplicity. Previous methods like COLMAP and GLOMAP are able to estimate high-precision camera poses, but suffer from poor scalability when the number of matched keypoint pairs becomes large. We identify two key factors leading to this problem: poor parallelization and computationally expensive optimization steps. To overcome these issues, we design an SfM framework that relies entirely on GPU-friendly operations, making it easily parallelizable. Moreover, each optimization step runs in time linear to the number of image pairs, independent of keypoint pairs or 3D points. Through extensive experiments, we show that FastMap is faster than COLMAP and GLOMAP on large-scale scenes with comparable pose accuracy.

PDF Abstract

FastMap is a novel Structure from Motion (SfM) method designed to achieve significant speed and scalability improvements for dense 3D reconstruction compared to state-of-the-art methods like COLMAP and GLOMAP. The core innovation lies in developing algorithms that are entirely GPU-friendly and whose computational complexity per iteration is linear in the number of image pairs, rather than the number of keypoint pairs or 3D points. This contrasts with traditional methods that often rely on computationally expensive optimization techniques like bundle adjustment, which scales poorly with scene complexity and is typically CPU-bound.

The FastMap pipeline consists of several sequential stages after initial feature extraction and matching (which uses standard methods identical to COLMAP/GLOMAP):

Intrinsics Estimation: FastMap estimates camera distortion and focal lengths.
- Distortion: It uses a one-parameter division model and a brute-force interval search approach. For a set of images from the same camera, candidate distortion parameters are sampled from an interval. For each candidate, keypoints are undistorted, fundamental matrices are re-estimated, and the average epipolar error is calculated. The parameter minimizing this error is chosen. This process is highly parallelizable on a GPU. A hierarchical search strategy further accelerates this. For multiple cameras, distortion is estimated iteratively for each camera based on image pairs where the distortion of the other camera is known or being estimated. The practical importance of accurate distortion estimation is highlighted by ablation studies showing catastrophic performance drops without it (Table 3, Figure 2).
- Focal Length: After undistorting keypoints, focal length is estimated based on the property that for a correct essential matrix $\mathbf{E} = \mathbf{K}^\top \mathbf{F} \mathbf{K}$ , where $\mathbf{K}$ is the intrinsics matrix derived from focal length $f$ , the ratio of its largest two singular values $\lambda_1/\lambda_2$ should be close to 1. FastMap uses an interval search over possible focal lengths, evaluating candidates based on a score that rewards $\lambda_1/\lambda_2 \approx 1$ across multiple image pairs (Eq. 2). This is also GPU-parallelizable.
Global Rotation Estimation: With estimated intrinsics, relative rotations and translations are decomposed from fundamental/homography matrices. Global world-to-camera rotations $\{R_i\}$ are estimated by minimizing the geodesic distance between predicted relative rotations $R_j R_i^\top$ and the measured relative rotations $R^{i \to j}$ over all image pairs (Eq. 3). This is formulated as a nonlinear optimization problem using a differentiable 6D rotation representation and solved via gradient descent. A key practical aspect is the initialization strategy, which modifies a prior method [Martinec and Pajdla 2007] by decomposing the matrix optimization into column-wise least squares problems that can be solved using SVD, followed by enforcing orthogonality. Image pairs with insufficient inliers are filtered out, using an adaptive threshold to maintain connectivity.
Track Completion: FastMap enhances the input data by performing track completion. Tracks are connected components in the keypoint matching graph, representing projected 2D points of the same 3D scene point across multiple images. Instead of using tracks for bundle adjustment, FastMap converts them into additional pairwise matches between all keypoints within each track. This increases the density of image connections and provides more constraints for subsequent steps without introducing 3D point variables. Ablation studies show that track completion improves performance (Table 5).
Global Translation Estimation: Camera center locations $\{o_i\}$ in the world frame are estimated. FastMap first re-estimates more accurate relative translations using the now-available global rotations. Then, it minimizes the L1 error between the normalized vector $(o_j - o_i)/\|o_j - o_i\|$ and the target normalized relative translation vectors for all image pairs (Eq. 5). Unlike methods that jointly optimize poses and 3D points, FastMap optimizes camera locations directly. To mitigate local minima, multiple random initializations are used, and the resulting solutions are merged based on their loss (Table 4).
Epipolar Adjustment: This is the final pose refinement step, replacing traditional bundle adjustment. It minimizes the absolute epipolar error (L1 loss) for all point pairs across all selected image pairs (Eq. 6). Directly optimizing the L1 loss over millions of points is slow. FastMap addresses this using Iterative Re-weighted Least Squares (IRLS), approximating the L1 loss with a weighted L2 loss where weights are based on current epipolar errors. The key insight is that the L2 loss objective can be reformulated as a sum of quadratic forms involving only the essential matrices $E_n$ , weighted by matrices $W_n$ that can be pre-computed based on the point pairs for each image pair (Eq. 8). This makes each optimization iteration linear in the number of image pairs, independent of the number of points. Focal lengths can also be jointly refined. Outlier point pairs are periodically pruned based on epipolar error thresholds.

Implementation Considerations and Performance:

FastMap is implemented entirely in PyTorch, leveraging GPU acceleration for dense tensor operations across all stages.
The speed improvements are significant, achieving one to two orders of magnitude faster processing times than COLMAP and GLOMAP on large-scale scenes with thousands of images (Table 1, Figure 1).
Pose accuracy is shown to be comparable to GLOMAP and COLMAP on most datasets, especially for less strict metrics like RTA@3. Performance on stricter metrics (RTA@1, AUC@1/3) is slightly lower but often yields comparable downstream novel view synthesis quality (Table 2).
FastMap shows stronger robustness than COLMAP and GLOMAP on some challenging scenes (e.g., Mill-19 building), but is more sensitive to challenging cases with repetitive patterns or symmetric structures (e.g., Tanks and Temples Advanced split) and scenes dominated by planar structures where fundamental matrices are unreliable for focal length estimation.
The method is designed for dense image coverage and may struggle with sparse or challenging motion patterns (e.g., straight line motions common in SLAM-like datasets), where bundle adjustment's use of 3D points helps resolve ambiguities (Appendix D).

After pose estimation, a sparse 3D point cloud can be reconstructed via triangulation of the matched and completed keypoint tracks, with standard outlier filtering (Appendix B). This sparse point cloud can be used for downstream tasks like initializing Gaussian Splatting.

In summary, FastMap provides a highly efficient alternative for large-scale dense SfM by replacing computationally expensive components with GPU-friendly, linear-complexity-per-iteration optimization techniques like re-weighting epipolar adjustment, demonstrating competitive accuracy and speed in practical applications like camera pose estimation for neural rendering. The planned open-source release aims to make these advancements available to the research community.