Papers
Topics
Authors
Recent
Search
2000 character limit reached

Surf3R: Fast 3D Surface Reconstruction

Updated 3 July 2026
  • Surf3R is an end-to-end feed-forward framework for rapid 3D surface reconstruction from sparse, unposed RGB images, eliminating the need for camera calibration.
  • It employs a multi-branch decoding architecture with cross-view attention and inter-branch fusion to aggregate geometric cues, achieving state-of-the-art performance in under 10 seconds.
  • The framework integrates a geometry-aware D-Normal regularizer with an explicit 3D Gaussian representation, ensuring high precision in novel-view synthesis and surface consistency.

Surf3R is an end-to-end, feed-forward framework for rapid 3D surface reconstruction from sparse, unposed RGB images. Unlike prior methods that require camera calibration or pose estimation, Surf3R performs scene-level 3D geometry prediction in a single pass, completing a typical scene in under 10 seconds. The approach centers on a multi-branch, multi-view decoding architecture with cross-view attention and inter-branch feature fusion, and introduces a geometry-aware D-Normal regularizer leveraging an explicit 3D Gaussian representation for differentiable surface learning. Surf3R delivers state-of-the-art surface reconstruction metrics and enables novel-view synthesis with high consistency, precision, and speed, even under sparse and noisy visual input (Zhu et al., 6 Aug 2025).

1. Architecture and Input Processing

Surf3R accepts a set of NN unposed RGB images {Ii}i=1N\{I_i\}_{i=1}^N resized to 224×224224 \times 224 as input. Each image is processed by a shared, weight-tied Vision Transformer (ViT) encoder, extracting multi-scale tokens F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}.

A set of MM reference views {rm}m=1M\{r_m\}_{m=1}^M is chosen for multi-branch decoding. For each branch mm, decoding is centered on the reference view IrmI_{r_m} and proceeds through DD cascaded Feature-Refine Blocks (FRBlocks), each followed by Cross-Reference Fusion Blocks (CRFBlocks). The output tokens FDv,mF_D^{v,m} represent per-view, per-branch geometry descriptors. These are consumed by specialized heads to regress per-pixel 3D Gaussian primitives, a point-map, and a confidence map. This design aggregates complementary geometric cues from all input views, supporting 3D reasoning in the absence of camera or pose priors.

2. Multi-Branch Decoding, Cross-View Attention, and Fusion

For each layer {Ii}i=1N\{I_i\}_{i=1}^N0, tokens for view {Ii}i=1N\{I_i\}_{i=1}^N1 in branch {Ii}i=1N\{I_i\}_{i=1}^N2 are denoted {Ii}i=1N\{I_i\}_{i=1}^N3. FRBlock processing distinguishes between reference and source views: {Ii}i=1N\{I_i\}_{i=1}^N4 with {Ii}i=1N\{I_i\}_{i=1}^N5 the set of tokens from all other views. Each FRBlock applies multi-head cross-attention: query {Ii}i=1N\{I_i\}_{i=1}^N6, key {Ii}i=1N\{I_i\}_{i=1}^N7, value {Ii}i=1N\{I_i\}_{i=1}^N8, forming: {Ii}i=1N\{I_i\}_{i=1}^N9 After each FRBlock, tokens are fused across branches (same view, different reference) via CRFBlock. A typical instantiation is: 224×224224 \times 2240 Cascading 224×224224 \times 2241 layers of FR+CRF delivers fused tokens for each branch and view, incorporating global and local geometric context across all images.

3. 3D Gaussian Representation and D-Normal Regularizer

Each output pixel is parameterized as an anisotropic 3D Gaussian (center 224×224224 \times 2242, scales 224×224224 \times 2243, rotation quaternion 224×224224 \times 2244, opacity 224×224224 \times 2245). The D-Normal regularizer couples surface normals, depth, and Gaussian geometry for improved detail and consistency:

  • Flattening loss enforces local planarity:

224×224224 \times 2246

  • Surface normal is defined as 224×224224 \times 2247, where 224×224224 \times 2248 comes from 224×224224 \times 2249 and F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}0 (the collapsed direction).
  • The rendered normal map F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}1 combines per-pixel normals using alpha compositing:

F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}2

  • Differentiable depth rendering via plane-ray intersection, and D-Normal loss:

F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}3

where F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}4 denotes normals estimated from local depth gradient and F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}5 is the ground-truth normal. Supervision aligns normal and depth geometry, supporting accurate surface recovery.

4. Training Objectives and Loss Functions

Surf3R is supervised through a composite loss: F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}6

  • F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}7 is a confidence-weighted pointmap regression loss.
  • F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}8 is the L1 photometric loss between rendered and ground-truth RGB images.
  • F0i=ViT(Ii)∈Rh×w×dF_0^i = \mathrm{ViT}(I_i) \in \mathbb R^{h \times w \times d}9 is the scale/flattening loss.
  • MM0 is the normal-map loss, combining L1 and cosine terms.
  • MM1 is the D-Normal loss.

These objectives ensure predictions are geometrically self-consistent and align with ground-truth surface/normal data at all levels.

5. Experimental Setup and Evaluation

Surf3R is benchmarked primarily on ScanNet++ and Replica datasets:

  • Training: 50 indoor scenes (ScanNet++) sampled with 30–70% point-cloud overlap between MM2 views.
  • Inference: Flexible—4 to 100 views; typically MM3 reference branches.
  • Metrics: vertex-level surface precision, recall, F1-score (within 2 cm GT tolerance); novel-view synthesis (PSNR, SSIM, LPIPS).

Results summary (ScanNet++ 50-scene average):

Method Precision↑ Recall↑ F1↑ Time
NeuS 29.42 22.14 25.13 >30 min
SuGaR 38.30 34.92 36.12 >30 min
DUSt3R 4.62 4.84 4.06 >1 min
Surf3R-GD 80.24 77.55 78.71 <10 s

On Replica (zero-shot), Surf3R-GD achieves F1=41.92, outperforming NeuralRecon and DUSt3R. On novel-view synthesis, Surf3R-GD delivers PSNR 15.06 (4 views) vs. DUSt3R’s 11.66. Ablation shows significant impact for multi-branch design (MM415.39 F1), the D-Normal regularizer (MM510.96 F1), and normal/scale losses.

6. Methodological Advances and Limitations

Key advances include:

  • Elimination of camera calibration/pose estimation—Surf3R reconstructs surfaces from unposed RGB alone.
  • Multi-branch decoder with transformer-based cross-view/branch attention, aggregating geometric signals from all views.
  • D-Normal regularizer enforces geometric coupling and consistency between depth, normal, and local patch structure in the Gaussian domain.

Identified limitations: degradation with too many wide-baseline views (noisy overlap statistics), scaling challenges for very large or dynamic environments, and opportunity for further speedup with lighter encoder backbones or adaptive branch strategies.

7. Context, Significance, and Outlook

Surf3R marks a departure from sequential or optimization-based multi-view 3D reconstruction pipelines. It demonstrates that accurate, surface-level geometric reasoning is feasible via purely feed-forward, transformer-driven fusion, without explicit camera or pose modeling. The unified Gaussian representation and D-Normal regularizer together yield both fine detail and global consistency in the reconstructed surfaces. A plausible implication is that future real-time SLAM and AR applications could integrate Surf3R-style pose-free geometry modules, provided robustness to extreme viewpoint disparity and dynamic content can be maintained (Zhu et al., 6 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Surf3R.