Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
(2501.13928v2)
Published 23 Jan 2025 in cs.CV, cs.AI, cs.GR, and cs.RO
Abstract: Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. Fast3R's Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.
Summary
The paper introduces Fast3R, a Transformer-based model that performs 3D reconstruction from large-scale unordered image sets in a single forward pass.
It reduces computational complexity from quadratic to linear, enabling efficient processing of up to 1500 views while addressing memory limitations.
Empirical results demonstrate near-perfect camera pose accuracy (99.7% within 15°) and a 14× error reduction compared to previous methods.
The paper introduces Fast3R (Fast 3D Reconstruction), a novel Transformer-based model for efficient multi-view 3D reconstruction from a set of unordered, unposed RGB images I∈RN×H×W×3. Fast3R predicts a corresponding pointmap X∈RN×H×W×3, where N is the number of input images, H is the height, and W is the width.
Fast3R addresses the limitations of pairwise approaches like DUSt3R, which require O(N2) operations and global alignment for multi-view reconstruction, leading to computational bottlenecks and memory issues. Fast3R processes N images in a single forward pass, leveraging a Transformer architecture to enable each frame to attend to all other frames simultaneously. This approach reduces error accumulation and enhances scalability.
The contributions of this work are:
The introduction of Fast3R, a Transformer-based model that obviates the need for global post-processing in multi-view pointmap estimation.
Empirical evidence that model performance improves with scaling along the view axis.
Demonstration of state-of-the-art performance in camera pose estimation, achieving 99.7\% accuracy within 15 degrees on CO3Dv2, a 14x error reduction compared to DUSt3R with global alignment.
Fast3R predicts both local and global pointmaps, XL and XG, along with corresponding confidence maps ΣL and ΣG with shape Σ∈RN×H×W. The model maps N RGB images to (XL,ΣL,XG,ΣG). The global pointmap XG is in the coordinate frame of the first camera, while XL is in the viewing camera's coordinate frame.
The total loss function Ltotal is the sum of pointmap losses for the local and global pointmaps:
$\mathcal{L_{\text{total}} = \mathcal{L_{\mathbf{X}_\text{G}} + \mathcal{L_{\mathbf{X}_\text{L}}}$ which are confidence-weighted versions of the normalized 3D pointwise regression loss.
The normalized regression loss for X is:
ℓregr(X^,X)=z^1X^−z1X2, where z=∣X∣1x∈X∑∥x∥2.
X^ is the predicted pointmap.
X is the target pointmap.
z^ is the normalization factor for the predicted pointmap.
z is the normalization factor for the target pointmap.
x represents individual points in the pointmap X.
∣X∣ is the number of points in the pointmap X.
The pointmap loss is given by:
LX(Σ^,X^,X)=∣X∣1∑Σ^+⋅ℓregr(X^,X)+αlog(Σ^+) where Σ^+=1+exp(Σ^).
LX is the total loss for a pointmap.
Σ^ is the confidence score predicted by the model.
α is a weighting factor.
The model architecture consists of three main components: an image encoder, a fusion transformer, and a pointmap decoder.
Image Encoder: Each image Ii∈I is encoded into a set of patch features Hi using a feature extractor F: Hi=F(Ii), for i∈1,...,N. A CroCo ViT is used as the encoder. Positional embeddings with one-dimensional image index positional embeddings are added.
Fusion Transformer: A 12-layer Transformer, similar to ViT-B or BERT, performs all-to-all self-attention on the concatenated encoded image patches.
Pointmap Head: Separate DPT-L decoder heads map the tokens to local and global pointmaps (XL,XG) and confidence maps (ΣL,ΣG).
To handle a larger number of views at inference than during training, Position Interpolation is adapted, drawing N indexes randomly from a larger pool N′ of possible samples during training. This strategy allows Fast3R to handle N=1000 images during inference, even if trained with N=20 images.
The architecture of Fast3R is designed to leverage recent advances in scalability, such as model and data parallelism, FlashAttention, and optimizer sharding.
The model is trained on a mix of real-world object-centric and scene scan data, including CO3D, ScanNet++, ARKitScenes, and Habitat, using a subset of the datasets in DUSt3R. The models are trained on 512×512 images using AdamW for 6.5K steps, with a learning rate of 0.0001 and cosine annealing schedule. DeepSpeed ZeRO stage 2 training is used to enable training with up to N=28 views per data sample.
At inference time, memory bottlenecks in the DPT heads are addressed through tensor parallelism, where the model is placed on GPU 0, and the DPT heads are copied to other GPUs for parallel inference. Fast3R can process up to 1500 views in a single pass, whereas DUSt3R runs out of memory past 32 views.
In camera pose estimation on CO3D, Fast3R surpasses other methods in Relative Rotation Accuracy (RRA) and mean Average Accuracy (mAA), while remaining competitive in Relative Translation Accuracy (RTA). It achieves near-perfect RRA and is significantly faster than DUSt3R and MASt3R. Specifically, Fast3R gets 99.7\% accuracy within 15-degrees for pose estimation on CO3Dv2.
For 3D reconstruction on scene-level benchmarks (7-Scenes and Neural RGB-D) and object-level benchmarks (DTU), Fast3R is competitive with other pointmap reconstruction methods while being significantly faster. Fast3R achieves a fastest FPS of 251.1 using 108 views in 224x224 resolution. Local pointmaps are used for detail, and global pointmaps for high-level structure, aligning each image's local pointmap to the global pointmap using Iterative Closest Point (ICP).
Qualitative results on 4D reconstruction demonstrate Fast3R's ability to handle dynamic scenes by fine-tuning on the PointOdyssey and TartanAir datasets, showing reasonable reconstructions with minimal changes.
Ablation studies demonstrate that:
Training on more views consistently improves RRA and RTA for visual odometry and reconstruction accuracy.
The randomized version of Position Interpolation enables inference on more views than seen during training.
Removing the local head affects the learning of finer details before the global head.
Larger model sizes continually benefit 3D tasks including camera pose estimations and 3D reconstruction.
Fast3R continually benefits from more data, suggesting Fast3R could achieve better results in the future given more data.