Papers
Topics
Authors
Recent
Search
2000 character limit reached

Visual Geometry Transformer in the Wild

Updated 23 June 2026
  • The paper demonstrates that VGTW achieves state-of-the-art 3D reconstruction by robustly suppressing transient distractors without relying on 3D ground truth.
  • It integrates LoRA-enabled self-attention, auxiliary mask supervision, and a unified loss function to maintain geometric consistency across diverse scenes.
  • Results on benchmarks like NeRF-on-the-go and RobustNeRF show significant improvements in accuracy, completeness, and normal consistency metrics.

Visual Geometry Transformer in the Wild (VGTW) is an end-to-end, feed-forward framework designed for multi-view 3D reconstruction from unstructured image collections containing transient distractors, such as moving people or vehicles. Unlike preceding transformer-based approaches that assume perfectly static, distractor-free scenes, VGTW introduces mechanisms to robustly suppress distractor effects and maintain geometric consistency without relying on 3D ground truth, thereby achieving state-of-the-art reconstruction performance in diverse, real-world conditions (Pan et al., 22 Jun 2026).

1. Problem Motivation and Distinctive Features

Traditional end-to-end multi-view 3D reconstruction frameworks achieve high fidelity under the assumption of static, clean environments. This assumption fails in practical, in-the-wild scenarios where transient distractors and occlusions are prevalent. Such conditions cause even state-of-the-art methods, including transformer-based networks like VGGT and π³, to generate spurious 3D artifacts (“ghost points”) or lose geometric fidelity.

VGTW directly addresses this limitation by introducing a distractor-suppressive pipeline, centered on a distractor-aware training (DAT) paradigm and auxiliary mask supervision. Key novel components are: (1) fine-tuned Low-Rank Adaptation (LoRA) adapters within all self-attention layers to control feature attribution, (2) an auxiliary mask-prediction head supervised with pixel-perfect distractor masks from the newly introduced RobustNeRF-Mask dataset, and (3) a unified objective that combines mask, suppression, and cross-view consistency losses within feature space. The result is a system that outputs clean, distractor-filtered point clouds efficiently and consistently, without any 3D ground truth supervision or per-scene optimization.

2. System Architecture and Data Flow

VGTW processes NN unstructured RGB views {I1,,IN}\{I_1, \ldots, I_N\} (each of size 3×H×W3 \times H \times W) through the following stages:

  • Patch Embedding: Each image is tokenized using a pretrained DINOv2 ViT, producing patchwise tokens.
  • Transformer Backbone with LoRA: Tokens are processed by interleaved view-wise and global self-attention modules, maintaining the architecture of VGGT/π³ but supplemented with LoRA adapters in every attention layer. Only LoRA parameters are updated during DAT, preserving the stability of the pretrained backbone.
  • Multi-Head Decoding: The final features HiH_i are fed into three DPT-style heads predicting (1) camera pose gig_i, (2) depth map DiD_i, and (3) point map PiP_i. In the case of VGGT, an additional tracking head TiT_i is present.
  • Mask Head: A small convolutional-MLP head Headmask(H)Mi{0,1}H×WHead_{mask}(H) \rightarrow M_i \in \{0,1\}^{H \times W} predicts a per-pixel binary distractor mask.

At inference, the predicted depth DiD_i and pose {I1,,IN}\{I_1, \ldots, I_N\}0 are used to generate point cloud {I1,,IN}\{I_1, \ldots, I_N\}1 with per-point confidence {I1,,IN}\{I_1, \ldots, I_N\}2, which is then filtered using {I1,,IN}\{I_1, \ldots, I_N\}3 (removing points with {I1,,IN}\{I_1, \ldots, I_N\}4 or {I1,,IN}\{I_1, \ldots, I_N\}5). Optionally, {I1,,IN}\{I_1, \ldots, I_N\}6 cleaned point clouds are fused into a single 3D reconstruction. The full forward pass for five views incurs only a minor computational overhead ({I1,,IN}\{I_1, \ldots, I_N\}70.38 s vs. 0.36 s for plain VGGT).

3. Distractor-Aware Training (DAT) Paradigm

DAT is central to VGTW’s ability to distinguish between static and distractor-contaminated regions. DAT uses exclusively 2D mask annotations, requiring no depth or 3D point supervision. The fine-tuning process and loss structure are as follows:

  • Distractor Suppression Loss ({I1,,IN}\{I_1, \ldots, I_N\}8): For each patch feature {I1,,IN}\{I_1, \ldots, I_N\}9 in view 3×H×W3 \times H \times W0, identify the most similar feature 3×H×W3 \times H \times W1 amongst all other views via cosine similarity 3×H×W3 \times H \times W2. For features marked as distractor by ground-truth mask 3×H×W3 \times H \times W3, enforce 3×H×W3 \times H \times W4:

3×H×W3 \times H \times W5

  • Cross-View Consistency Loss (3×H×W3 \times H \times W6): For non-distractor features (3×H×W3 \times H \times W7), maximize cross-view similarity across best-matching pairs within a soft margin 3×H×W3 \times H \times W8:

3×H×W3 \times H \times W9

  • Mask Supervision (HiH_i0): The mask head is trained with binary cross-entropy loss against ground-truth pixel masks HiH_i1 from RobustNeRF-Mask:

HiH_i2

  • Full Objective: The summed loss is HiH_i3, with weights HiH_i4 and margin parameters HiH_i5.

All loss gradients are restricted to LoRA adapters in the attention pathways, ensuring stable adaptation from the pretrained DINOv2 ViT.

4. Evaluation Benchmarks and Results

Empirical performance is reported on two rigorous benchmarks designed to reflect transient distractors:

  • NeRF-on-the-go: Scenarios with synthetic occlusion splits (Low/Medium/High), "invisible" during training.
  • RobustNeRF: Static scenes with embedded synthetic and real distractors.

The following metrics are used:

  • Accuracy (Acc HiH_i6)
  • Completeness (Comp HiH_i7)
  • Normal Consistency (NC HiH_i8)
  • Absolute Relative Depth Error (Abs Rel HiH_i9)
  • gig_i0 threshold for depth accuracy (gig_i1 higher is better)

Tables below summarize results averaged across disturbance levels:

Table 1. NeRF-on-the-go (point-map reconstruction):

Method Acc ↓ Comp ↓ NC ↑
DUSt3R 0.037 0.080 0.747
MaSt3R 0.045 0.157 0.692
Fast3R 0.041 0.069 0.680
VGGT 0.041 0.146 0.640
+ VGTW(VGGT) 0.033 0.117 0.704
π³ 0.051 0.074 0.709
+ VGTW(π³) 0.027 0.060 0.692

Table 2. RobustNeRF (point-map reconstruction):

Method Acc ↓ Comp ↓ NC ↑
DUSt3R 0.031 0.068 0.696
MaSt3R 0.038 0.388 0.616
Fast3R 0.034 0.105 0.660
VGGT 0.021 0.045 0.684
+ VGTW(VGGT) 0.011 0.025 0.740
π³ 0.017 0.016 0.754
+ VGTW(π³) 0.010 0.010 0.718

Depth-metric improvements include an increase in RobustNeRF’s mean gig_i2 from 81.1% (VGGT) to 94.5% (VGTW(VGGT)), and on NeRF-on-the-go (medium occlusion) Abs Rel drops from 0.246 to 0.125 with gig_i3 increasing from 58.7 to 82.2.

5. Component Ablations and Qualitative Assessment

Ablation studies on NeRF-on-the-go establish the impact of DAT elements:

Configuration Overall (Acc+Comp)/2 ↓ NC ↑
Baseline (no DAT) 0.048 0.663
+ gig_i4 only 0.040 0.653
+ gig_i5 0.031 0.693
+ mask head too 0.031 0.695

Qualitative results demonstrate that VGTW eliminates ghost artifacts from transient distractors (e.g., moving people, animals) while retaining geometric detail in static areas. For instance, in the crab scene, the static shell is cleanly reconstructed, while person/dog artifacts are fully suppressed. Comparable ghosting suppression is observed in wild DAVIS video tests. In distractor-free scenes (DTU benchmark), VGTW matches the baseline VGGT, signifying no performance degradation when distractors are absent.

6. Auxiliary Mask Supervision and Dataset

The auxiliary mask head is supplied with pixel-level distractor masks from the proprietary RobustNeRF-Mask dataset, annotated with per-pixel precision. Ground-truth masks gig_i6 supervise mask head training via binary cross-entropy, and predicted masks gig_i7 are thresholded at gig_i8 during inference to reliably identify and filter distractor points. This 2D-only supervision strategy eliminates the need for any form of 3D ground truth or per-instance optimization.

7. Integration, Computational Efficiency, and Compatibility

VGTW’s pipeline remains computationally efficient, with only minor latency overhead relative to VGGT, and is compatible with established transformer backbones. It requires no additional 3D supervision and can be incorporated into upstream or downstream 3D vision pipelines without structural modification. The LoRA-enabled adaptation also ensures that pretrained models are leveraged effectively and that task-specific final fine-tuning is robust and parameter-efficient.

In summary, VGTW establishes a new state of the art for robust, distractor-free multi-view 3D reconstruction in unconstrained settings by integrating transformer-based geometric reasoning, pixel-level mask supervision, and efficient attention weight adaptation within a unified, lightweight training and inference architecture (Pan et al., 22 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Visual Geometry Transformer in the Wild (VGTW).