Surf3R: Fast 3D Surface Reconstruction

Updated 3 July 2026

Surf3R is an end-to-end feed-forward framework for rapid 3D surface reconstruction from sparse, unposed RGB images, eliminating the need for camera calibration.
It employs a multi-branch decoding architecture with cross-view attention and inter-branch fusion to aggregate geometric cues, achieving state-of-the-art performance in under 10 seconds.
The framework integrates a geometry-aware D-Normal regularizer with an explicit 3D Gaussian representation, ensuring high precision in novel-view synthesis and surface consistency.

Surf3R is an end-to-end, feed-forward framework for rapid 3D surface reconstruction from sparse, unposed RGB images. Unlike prior methods that require camera calibration or pose estimation, Surf3R performs scene-level 3D geometry prediction in a single pass, completing a typical scene in under 10 seconds. The approach centers on a multi-branch, multi-view decoding architecture with cross-view attention and inter-branch feature fusion, and introduces a geometry-aware D-Normal regularizer leveraging an explicit 3D Gaussian representation for differentiable surface learning. Surf3R delivers state-of-the-art surface reconstruction metrics and enables novel-view synthesis with high consistency, precision, and speed, even under sparse and noisy visual input (Zhu et al., 6 Aug 2025).

1. Architecture and Input Processing

Surf3R accepts a set of $N$ unposed RGB images $\{I_i\}_{i=1}^N$ resized to $224 \times 224$ as input. Each image is processed by a shared, weight-tied Vision Transformer (ViT) encoder, extracting multi-scale tokens $F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ .

A set of $M$ reference views $\{r_m\}_{m=1}^M$ is chosen for multi-branch decoding. For each branch $m$ , decoding is centered on the reference view $I_{r_m}$ and proceeds through $D$ cascaded Feature-Refine Blocks (FRBlocks), each followed by Cross-Reference Fusion Blocks (CRFBlocks). The output tokens $F_D^{v,m}$ represent per-view, per-branch geometry descriptors. These are consumed by specialized heads to regress per-pixel 3D Gaussian primitives, a point-map, and a confidence map. This design aggregates complementary geometric cues from all input views, supporting 3D reasoning in the absence of camera or pose priors.

2. Multi-Branch Decoding, Cross-View Attention, and Fusion

For each layer $\{I_i\}_{i=1}^N$ 0, tokens for view $\{I_i\}_{i=1}^N$ 1 in branch $\{I_i\}_{i=1}^N$ 2 are denoted $\{I_i\}_{i=1}^N$ 3. FRBlock processing distinguishes between reference and source views: $\{I_i\}_{i=1}^N$ 4 with $\{I_i\}_{i=1}^N$ 5 the set of tokens from all other views. Each FRBlock applies multi-head cross-attention: query $\{I_i\}_{i=1}^N$ 6, key $\{I_i\}_{i=1}^N$ 7, value $\{I_i\}_{i=1}^N$ 8, forming: $\{I_i\}_{i=1}^N$ 9 After each FRBlock, tokens are fused across branches (same view, different reference) via CRFBlock. A typical instantiation is: $224 \times 224$ 0 Cascading $224 \times 224$ 1 layers of FR+CRF delivers fused tokens for each branch and view, incorporating global and local geometric context across all images.

3. 3D Gaussian Representation and D-Normal Regularizer

Each output pixel is parameterized as an anisotropic 3D Gaussian (center $224 \times 224$ 2, scales $224 \times 224$ 3, rotation quaternion $224 \times 224$ 4, opacity $224 \times 224$ 5). The D-Normal regularizer couples surface normals, depth, and Gaussian geometry for improved detail and consistency:

Flattening loss enforces local planarity:

$224 \times 224$ 6

Surface normal is defined as $224 \times 224$ 7, where $224 \times 224$ 8 comes from $224 \times 224$ 9 and $F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 0 (the collapsed direction).
The rendered normal map $F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 1 combines per-pixel normals using alpha compositing:

$F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 2

Differentiable depth rendering via plane-ray intersection, and D-Normal loss:

$F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 3

where $F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 4 denotes normals estimated from local depth gradient and $F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 5 is the ground-truth normal. Supervision aligns normal and depth geometry, supporting accurate surface recovery.

4. Training Objectives and Loss Functions

Surf3R is supervised through a composite loss: $F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 6

$F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 7 is a confidence-weighted pointmap regression loss.
$F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 8 is the L1 photometric loss between rendered and ground-truth RGB images.
$F_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}$ 9 is the scale/flattening loss.
$M$ 0 is the normal-map loss, combining L1 and cosine terms.
$M$ 1 is the D-Normal loss.

These objectives ensure predictions are geometrically self-consistent and align with ground-truth surface/normal data at all levels.

5. Experimental Setup and Evaluation

Surf3R is benchmarked primarily on ScanNet++ and Replica datasets:

Training: 50 indoor scenes (ScanNet++) sampled with 30–70% point-cloud overlap between $M$ 2 views.
Inference: Flexible—4 to 100 views; typically $M$ 3 reference branches.
Metrics: vertex-level surface precision, recall, F1-score (within 2 cm GT tolerance); novel-view synthesis (PSNR, SSIM, LPIPS).

Results summary (ScanNet++ 50-scene average):

Method	Precision↑	Recall↑	F1↑	Time
NeuS	29.42	22.14	25.13	>30 min
SuGaR	38.30	34.92	36.12	>30 min
DUSt3R	4.62	4.84	4.06	>1 min
Surf3R-GD	80.24	77.55	78.71	<10 s

On Replica (zero-shot), Surf3R-GD achieves F1=41.92, outperforming NeuralRecon and DUSt3R. On novel-view synthesis, Surf3R-GD delivers PSNR 15.06 (4 views) vs. DUSt3R’s 11.66. Ablation shows significant impact for multi-branch design ( $M$ 415.39 F1), the D-Normal regularizer ( $M$ 510.96 F1), and normal/scale losses.

6. Methodological Advances and Limitations

Key advances include:

Elimination of camera calibration/pose estimation—Surf3R reconstructs surfaces from unposed RGB alone.
Multi-branch decoder with transformer-based cross-view/branch attention, aggregating geometric signals from all views.
D-Normal regularizer enforces geometric coupling and consistency between depth, normal, and local patch structure in the Gaussian domain.

Identified limitations: degradation with too many wide-baseline views (noisy overlap statistics), scaling challenges for very large or dynamic environments, and opportunity for further speedup with lighter encoder backbones or adaptive branch strategies.

7. Context, Significance, and Outlook

Surf3R marks a departure from sequential or optimization-based multi-view 3D reconstruction pipelines. It demonstrates that accurate, surface-level geometric reasoning is feasible via purely feed-forward, transformer-driven fusion, without explicit camera or pose modeling. The unified Gaussian representation and D-Normal regularizer together yield both fine detail and global consistency in the reconstructed surfaces. A plausible implication is that future real-time SLAM and AR applications could integrate Surf3R-style pose-free geometry modules, provided robustness to extreme viewpoint disparity and dynamic content can be maintained (Zhu et al., 6 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Surf3R: Rapid Surface Reconstruction from Sparse RGB Views in Seconds (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Surf3R.