Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wild3R: Feed-Forward 3D Gaussian Splatting

Updated 14 June 2026
  • Wild3R Model is a feed-forward 3D Gaussian Splatting network that robustly reconstructs transient-free 3D scenes from sparse, unconstrained photo collections.
  • It leverages a modified AnySplat backbone with a Vision-Geometry Grounded Transformer and advanced augmentation to handle lighting variations and transient objects.
  • Wild3R achieves high-quality outputs approximately 30x faster than optimization-based methods, matching classical techniques in perceptual and pixel-wise metrics.

Wild3R is a feed-forward 3D Gaussian Splatting (3DGS) network designed for robust scene reconstruction from unconstrained, sparse photo collections that exhibit substantial variation in lighting and the presence of transient objects. The model achieves high-quality, transient-free 3D representations without per-scene optimization, advancing the applicability of 3DGS methods to practical, real-world photo sets characterized by low viewpoint overlap and inconsistent scene content (Furutani et al., 10 Jun 2026).

1. Architecture and Data Flow

Wild3R is built upon the AnySplat backbone, with modifications for reference-based conditioning and transient robustness. The primary architectural components are:

  • Input Pipeline: Receives NN unposed RGB images {In}n=1N\{I_n\}_{n=1}^N, each up to 448×448448 \times 448 pixels. One view I^1\hat{I}_1 is randomly designated as the reference.
  • Augmentation Strategies: Each input undergoes per-image spatial (scaling, cropping) and appearance augmentation (JPEG compression, Gaussian noise, color jitter, optional grayscale). Half of the views are randomly augmented with transient objects or varying illumination; 12.5% of images are further edited by a generative model (Gemini) to insert realistic transients (e.g., people, vehicles, signage).
  • Backbone: A Vision-Geometry Grounded Transformer (VGGT) extracts a global scene feature representation.
  • Heads:
    • Gaussian Head (hgh_g): Predicts GG anisotropic 3D Gaussian primitives G={gi}i=1G\mathcal{G} = \{g_i\}_{i=1}^G.
    • Depth Head: Outputs per-view depth maps; parameters are frozen after pretraining.
    • Camera Head: Predicts camera poses; also frozen at fine-tuning.
  • Inference: Only the backbone and Gaussian head are active (≈940 million trainable parameters).

2. 3D Gaussian Splatting Formulation

The scene representation consists of a set of GG anisotropic Gaussians, each parameterized by:

gi=(μi∈R3, Σi∈R3×3, αi∈R, ci(ω)∈R3)g_i = (\mu_i \in \mathbb{R}^3,\, \Sigma_i \in \mathbb{R}^{3 \times 3},\, \alpha_i \in \mathbb{R},\, c_i(\omega) \in \mathbb{R}^3)

where:

  • μi\mu_i: 3D center position,
  • {In}n=1N\{I_n\}_{n=1}^N0: anisotropic covariance, expressed via quaternion {In}n=1N\{I_n\}_{n=1}^N1 and scale vector {In}n=1N\{I_n\}_{n=1}^N2,
  • {In}n=1N\{I_n\}_{n=1}^N3: opacity,
  • {In}n=1N\{I_n\}_{n=1}^N4: view-dependent color, modeled by degree-{In}n=1N\{I_n\}_{n=1}^N5 spherical harmonics expansion in view direction {In}n=1N\{I_n\}_{n=1}^N6.

Rendering along a ray {In}n=1N\{I_n\}_{n=1}^N7 is defined by:

{In}n=1N\{I_n\}_{n=1}^N8

{In}n=1N\{I_n\}_{n=1}^N9

Discrete approximations allocate each Gaussian to its mean depth 448×448448 \times 4480; the overall color is composited per ray using alpha weights 448×448448 \times 4481 computed as:

448×448448 \times 4482

For 2D splat rendering, each Gaussian projects to an elliptical footprint, composited front-to-back using alpha blending:

448×448448 \times 4483

3. Reference Conditioning, Losses, and Transient Handling

Wild3R employs a reference-conditioning strategy:

  • Each training batch selects one reference view 448×448448 \times 4484, to which all reconstructions are appearance-matched.
  • During training, half of the images are randomly augmented with transient object insertions and illumination changes, with the reference view controlling the expected output style.

Supervision is enforced by several loss terms:

  • Appearance Consistency Loss (448×448448 \times 4485):

448×448448 \times 4486

where 448×448448 \times 4487 is the clean, reference-style image for view 448×448448 \times 4488 (matching the reference's illumination, no transients).

  • Depth Loss (448×448448 \times 4489):

I^1\hat{I}_10

  • Additional Regularization (adopted from AnySplat):
    • Geometric consistency (I^1\hat{I}_11): enforces bi-directional consistency between depths.
    • Camera pose supervision (I^1\hat{I}_12).
  • Total Loss:

I^1\hat{I}_13

with hyperparameters I^1\hat{I}_14, I^1\hat{I}_15, I^1\hat{I}_16, I^1\hat{I}_17.

Transient removal and lighting unification do not require specialized modules; the training regime, built on reference conditioning and strong 2D/appearance augmentation, suffices to strip view-specific occluders and harmonize appearance.

4. WildCity Dataset

Wild3R is trained on WildCity, a large-scale, synthetic dataset engineered for high diversity in appearance and scene content:

  • Scenes: 200 distinct urban environments created via Blender, SceneCity addon, and 130+ Sketchfab assets.
  • Illumination: 170 outdoor HDRI maps; each scene rendered under 30 lighting settings.
  • Viewpoints: 50 random camera positions per scene, fan-shaped with field-of-view in I^1\hat{I}_18, all at I^1\hat{I}_19 resolution.
  • Transients: 12.5% of the images (37,500 total) edited by Gemini for inserted objects.
  • Total Images: 337,500.
  • Mini-batch Strategy: Each batch samples a single scene, selects hgh_g0 views with a designated reference, and applies transient and photometric augmentations.

5. Training Protocol and Hyperparameters

  • Initialization: Fine-tuned from AnySplat pretrained weights; depth/camera heads are frozen.
  • Training Details:
    • Parameters: ≈940M (backbone + Gaussian head).
    • Stage: 30,000 iterations on 1×A100 GPU (~1 day).
    • Optimizer: AdamW.
    • Learning Rate: Cosine decay from hgh_g1 with 1K warmup steps.
    • Batch Sampling: One scene per iteration, hgh_g2 views per batch.
  • Spatial Augmentation: Random anisotropic rescale (0.9–1.1), resize (0.7–1.2), random/center crop to hgh_g3 pixels long edge.
  • Appearance Augmentation: JPEG (hgh_g4–hgh_g5), Gaussian noise (hgh_g6), color jitter or grayscale (hgh_g7). Any color transform applied to the reference view is applied identically to targets to preserve supervision consistency.

6. Quantitative Results and Comparison

Wild3R demonstrates superior performance to all camera-free feed-forward 3DGS methods, achieving results close to those of per-scene optimization-based approaches on the Photo Tourism benchmark (16 context views):

Method Unseen Camera Point Time PSNR↑ SSIM↑ LPIPS↓
NeRF-W [2021] ✗ ✗ 30 h 17.29 0.530 0.570
3DGS [Kerbl '23] ✗ ✗ 7.7 m 13.56 0.437 0.560
GS-W [Zhang '24] ✗ ✗ 24 m 15.17 0.501 0.504
AsymGS [Li '25] ✗ ✗ 30 m 18.37 0.607 0.463
Long-LRM [Chen '25] ✗ ✓ 0.18s 15.26 0.486 0.569
AnySplat [Jiang '25] ✓ ✓ 0.95s 13.72 0.377 0.546
YoNoSplat [Ye '26] ✓ ✓ 0.38s 13.04 0.403 0.640
DA3 [Lin '26] ✓ ✓ 1.6s 14.03 0.420 0.586
Wild3R (Ours) ✓ ✓ 0.95s 15.87 0.435 0.506

Key findings:

  • Wild3R is approximately hgh_g8 faster than optimization-based methods.
  • Closes much of the PSNR/SSIM gap compared to "classical" 3DGS and NeRF-W.
  • Outperforms all other camera-free feed-forward methods in perceptual (LPIPS) and pixel-wise (PSNR, SSIM) metrics.

7. Significance and Outlook

Wild3R's primary contributions are the demonstration that feed-forward 3DGS can match, or rival, per-scene optimization methods for unconstrained, sparse photo collections and that robust transient- and illumination-invariant scene representations can be learned using a reference-conditioned loss and carefully structured training data. The WildCity dataset represents a benchmark for future work on large-scale, robust 3D scene reconstruction in uncontrolled conditions (Furutani et al., 10 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wild3R Model.