Wild3R: Feed-Forward 3D Gaussian Splatting

Updated 14 June 2026

Wild3R Model is a feed-forward 3D Gaussian Splatting network that robustly reconstructs transient-free 3D scenes from sparse, unconstrained photo collections.
It leverages a modified AnySplat backbone with a Vision-Geometry Grounded Transformer and advanced augmentation to handle lighting variations and transient objects.
Wild3R achieves high-quality outputs approximately 30x faster than optimization-based methods, matching classical techniques in perceptual and pixel-wise metrics.

Wild3R is a feed-forward 3D Gaussian Splatting (3DGS) network designed for robust scene reconstruction from unconstrained, sparse photo collections that exhibit substantial variation in lighting and the presence of transient objects. The model achieves high-quality, transient-free 3D representations without per-scene optimization, advancing the applicability of 3DGS methods to practical, real-world photo sets characterized by low viewpoint overlap and inconsistent scene content (Furutani et al., 10 Jun 2026).

1. Architecture and Data Flow

Wild3R is built upon the AnySplat backbone, with modifications for reference-based conditioning and transient robustness. The primary architectural components are:

Input Pipeline: Receives $N$ unposed RGB images $\{I_n\}_{n=1}^N$ , each up to $448 \times 448$ pixels. One view $\hat{I}_1$ is randomly designated as the reference.
Augmentation Strategies: Each input undergoes per-image spatial (scaling, cropping) and appearance augmentation (JPEG compression, Gaussian noise, color jitter, optional grayscale). Half of the views are randomly augmented with transient objects or varying illumination; 12.5% of images are further edited by a generative model (Gemini) to insert realistic transients (e.g., people, vehicles, signage).
Backbone: A Vision-Geometry Grounded Transformer (VGGT) extracts a global scene feature representation.
Heads:
- Gaussian Head ( $h_g$ ): Predicts $G$ anisotropic 3D Gaussian primitives $\mathcal{G} = \{g_i\}_{i=1}^G$ .
- Depth Head: Outputs per-view depth maps; parameters are frozen after pretraining.
- Camera Head: Predicts camera poses; also frozen at fine-tuning.
Inference: Only the backbone and Gaussian head are active (≈940 million trainable parameters).

2. 3D Gaussian Splatting Formulation

The scene representation consists of a set of $G$ anisotropic Gaussians, each parameterized by:

$g_i = (\mu_i \in \mathbb{R}^3,\, \Sigma_i \in \mathbb{R}^{3 \times 3},\, \alpha_i \in \mathbb{R},\, c_i(\omega) \in \mathbb{R}^3)$

where:

$\mu_i$ : 3D center position,
$\{I_n\}_{n=1}^N$ 0: anisotropic covariance, expressed via quaternion $\{I_n\}_{n=1}^N$ 1 and scale vector $\{I_n\}_{n=1}^N$ 2,
$\{I_n\}_{n=1}^N$ 3: opacity,
$\{I_n\}_{n=1}^N$ 4: view-dependent color, modeled by degree- $\{I_n\}_{n=1}^N$ 5 spherical harmonics expansion in view direction $\{I_n\}_{n=1}^N$ 6.

Rendering along a ray $\{I_n\}_{n=1}^N$ 7 is defined by:

$\{I_n\}_{n=1}^N$ 8

$\{I_n\}_{n=1}^N$ 9

Discrete approximations allocate each Gaussian to its mean depth $448 \times 448$ 0; the overall color is composited per ray using alpha weights $448 \times 448$ 1 computed as:

$448 \times 448$ 2

For 2D splat rendering, each Gaussian projects to an elliptical footprint, composited front-to-back using alpha blending:

$448 \times 448$ 3

3. Reference Conditioning, Losses, and Transient Handling

Wild3R employs a reference-conditioning strategy:

Each training batch selects one reference view $448 \times 448$ 4, to which all reconstructions are appearance-matched.
During training, half of the images are randomly augmented with transient object insertions and illumination changes, with the reference view controlling the expected output style.

Supervision is enforced by several loss terms:

Appearance Consistency Loss ( $448 \times 448$ 5):

$448 \times 448$ 6

where $448 \times 448$ 7 is the clean, reference-style image for view $448 \times 448$ 8 (matching the reference's illumination, no transients).

Depth Loss ( $448 \times 448$ 9):

$\hat{I}_1$ 0

Additional Regularization (adopted from AnySplat):
- Geometric consistency ( $\hat{I}_1$ 1): enforces bi-directional consistency between depths.
- Camera pose supervision ( $\hat{I}_1$ 2).
Total Loss:

$\hat{I}_1$ 3

with hyperparameters $\hat{I}_1$ 4, $\hat{I}_1$ 5, $\hat{I}_1$ 6, $\hat{I}_1$ 7.

Transient removal and lighting unification do not require specialized modules; the training regime, built on reference conditioning and strong 2D/appearance augmentation, suffices to strip view-specific occluders and harmonize appearance.

4. WildCity Dataset

Wild3R is trained on WildCity, a large-scale, synthetic dataset engineered for high diversity in appearance and scene content:

Scenes: 200 distinct urban environments created via Blender, SceneCity addon, and 130+ Sketchfab assets.
Illumination: 170 outdoor HDRI maps; each scene rendered under 30 lighting settings.
Viewpoints: 50 random camera positions per scene, fan-shaped with field-of-view in $\hat{I}_1$ 8, all at $\hat{I}_1$ 9 resolution.
Transients: 12.5% of the images (37,500 total) edited by Gemini for inserted objects.
Total Images: 337,500.
Mini-batch Strategy: Each batch samples a single scene, selects $h_g$ 0 views with a designated reference, and applies transient and photometric augmentations.

5. Training Protocol and Hyperparameters

Initialization: Fine-tuned from AnySplat pretrained weights; depth/camera heads are frozen.
Training Details:
- Parameters: ≈940M (backbone + Gaussian head).
- Stage: 30,000 iterations on 1×A100 GPU (~1 day).
- Optimizer: AdamW.
- Learning Rate: Cosine decay from $h_g$ 1 with 1K warmup steps.
- Batch Sampling: One scene per iteration, $h_g$ 2 views per batch.
Spatial Augmentation: Random anisotropic rescale (0.9–1.1), resize (0.7–1.2), random/center crop to $h_g$ 3 pixels long edge.
Appearance Augmentation: JPEG ( $h_g$ 4– $h_g$ 5), Gaussian noise ( $h_g$ 6), color jitter or grayscale ( $h_g$ 7). Any color transform applied to the reference view is applied identically to targets to preserve supervision consistency.

6. Quantitative Results and Comparison

Wild3R demonstrates superior performance to all camera-free feed-forward 3DGS methods, achieving results close to those of per-scene optimization-based approaches on the Photo Tourism benchmark (16 context views):

Method	Unseen	Camera	Point	Time	PSNR↑	SSIM↑
NeRF-W [2021]	✗	✗	30 h	17.29	0.530	0.570
3DGS [Kerbl '23]	✗	✗	7.7 m	13.56	0.437	0.560
GS-W [Zhang '24]	✗	✗	24 m	15.17	0.501	0.504
AsymGS [Li '25]	✗	✗	30 m	18.37	0.607	0.463
Long-LRM [Chen '25]	✗	✓	0.18s	15.26	0.486	0.569
AnySplat [Jiang '25]	✓	✓	0.95s	13.72	0.377	0.546
YoNoSplat [Ye '26]	✓	✓	0.38s	13.04	0.403	0.640
DA3 [Lin '26]	✓	✓	1.6s	14.03	0.420	0.586
Wild3R (Ours)	✓	✓	0.95s	15.87	0.435	0.506

Key findings:

Wild3R is approximately $h_g$ 8 faster than optimization-based methods.
Closes much of the PSNR/SSIM gap compared to "classical" 3DGS and NeRF-W.
Outperforms all other camera-free feed-forward methods in perceptual (LPIPS) and pixel-wise (PSNR, SSIM) metrics.

7. Significance and Outlook

Wild3R's primary contributions are the demonstration that feed-forward 3DGS can match, or rival, per-scene optimization methods for unconstrained, sparse photo collections and that robust transient- and illumination-invariant scene representations can be learned using a reference-conditioned loss and carefully structured training data. The WildCity dataset represents a benchmark for future work on large-scale, robust 3D scene reconstruction in uncontrolled conditions (Furutani et al., 10 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wild3R Model.

Wild3R: Feed-Forward 3D Gaussian Splatting

1. Architecture and Data Flow

2. 3D Gaussian Splatting Formulation

3. Reference Conditioning, Losses, and Transient Handling

4. WildCity Dataset

5. Training Protocol and Hyperparameters

6. Quantitative Results and Comparison

7. Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Wild3R: Feed-Forward 3D Gaussian Splatting

1. Architecture and Data Flow

2. 3D Gaussian Splatting Formulation

3. Reference Conditioning, Losses, and Transient Handling

4. WildCity Dataset

5. Training Protocol and Hyperparameters

6. Quantitative Results and Comparison

7. Significance and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research