Wild3R: Feed-Forward 3D Gaussian Splatting
- Wild3R Model is a feed-forward 3D Gaussian Splatting network that robustly reconstructs transient-free 3D scenes from sparse, unconstrained photo collections.
- It leverages a modified AnySplat backbone with a Vision-Geometry Grounded Transformer and advanced augmentation to handle lighting variations and transient objects.
- Wild3R achieves high-quality outputs approximately 30x faster than optimization-based methods, matching classical techniques in perceptual and pixel-wise metrics.
Wild3R is a feed-forward 3D Gaussian Splatting (3DGS) network designed for robust scene reconstruction from unconstrained, sparse photo collections that exhibit substantial variation in lighting and the presence of transient objects. The model achieves high-quality, transient-free 3D representations without per-scene optimization, advancing the applicability of 3DGS methods to practical, real-world photo sets characterized by low viewpoint overlap and inconsistent scene content (Furutani et al., 10 Jun 2026).
1. Architecture and Data Flow
Wild3R is built upon the AnySplat backbone, with modifications for reference-based conditioning and transient robustness. The primary architectural components are:
- Input Pipeline: Receives unposed RGB images , each up to pixels. One view is randomly designated as the reference.
- Augmentation Strategies: Each input undergoes per-image spatial (scaling, cropping) and appearance augmentation (JPEG compression, Gaussian noise, color jitter, optional grayscale). Half of the views are randomly augmented with transient objects or varying illumination; 12.5% of images are further edited by a generative model (Gemini) to insert realistic transients (e.g., people, vehicles, signage).
- Backbone: A Vision-Geometry Grounded Transformer (VGGT) extracts a global scene feature representation.
- Heads:
- Gaussian Head (): Predicts anisotropic 3D Gaussian primitives .
- Depth Head: Outputs per-view depth maps; parameters are frozen after pretraining.
- Camera Head: Predicts camera poses; also frozen at fine-tuning.
- Inference: Only the backbone and Gaussian head are active (≈940 million trainable parameters).
2. 3D Gaussian Splatting Formulation
The scene representation consists of a set of anisotropic Gaussians, each parameterized by:
where:
- : 3D center position,
- 0: anisotropic covariance, expressed via quaternion 1 and scale vector 2,
- 3: opacity,
- 4: view-dependent color, modeled by degree-5 spherical harmonics expansion in view direction 6.
Rendering along a ray 7 is defined by:
8
9
Discrete approximations allocate each Gaussian to its mean depth 0; the overall color is composited per ray using alpha weights 1 computed as:
2
For 2D splat rendering, each Gaussian projects to an elliptical footprint, composited front-to-back using alpha blending:
3
3. Reference Conditioning, Losses, and Transient Handling
Wild3R employs a reference-conditioning strategy:
- Each training batch selects one reference view 4, to which all reconstructions are appearance-matched.
- During training, half of the images are randomly augmented with transient object insertions and illumination changes, with the reference view controlling the expected output style.
Supervision is enforced by several loss terms:
- Appearance Consistency Loss (5):
6
where 7 is the clean, reference-style image for view 8 (matching the reference's illumination, no transients).
- Depth Loss (9):
0
- Additional Regularization (adopted from AnySplat):
- Geometric consistency (1): enforces bi-directional consistency between depths.
- Camera pose supervision (2).
- Total Loss:
3
with hyperparameters 4, 5, 6, 7.
Transient removal and lighting unification do not require specialized modules; the training regime, built on reference conditioning and strong 2D/appearance augmentation, suffices to strip view-specific occluders and harmonize appearance.
4. WildCity Dataset
Wild3R is trained on WildCity, a large-scale, synthetic dataset engineered for high diversity in appearance and scene content:
- Scenes: 200 distinct urban environments created via Blender, SceneCity addon, and 130+ Sketchfab assets.
- Illumination: 170 outdoor HDRI maps; each scene rendered under 30 lighting settings.
- Viewpoints: 50 random camera positions per scene, fan-shaped with field-of-view in 8, all at 9 resolution.
- Transients: 12.5% of the images (37,500 total) edited by Gemini for inserted objects.
- Total Images: 337,500.
- Mini-batch Strategy: Each batch samples a single scene, selects 0 views with a designated reference, and applies transient and photometric augmentations.
5. Training Protocol and Hyperparameters
- Initialization: Fine-tuned from AnySplat pretrained weights; depth/camera heads are frozen.
- Training Details:
- Parameters: ≈940M (backbone + Gaussian head).
- Stage: 30,000 iterations on 1×A100 GPU (~1 day).
- Optimizer: AdamW.
- Learning Rate: Cosine decay from 1 with 1K warmup steps.
- Batch Sampling: One scene per iteration, 2 views per batch.
- Spatial Augmentation: Random anisotropic rescale (0.9–1.1), resize (0.7–1.2), random/center crop to 3 pixels long edge.
- Appearance Augmentation: JPEG (4–5), Gaussian noise (6), color jitter or grayscale (7). Any color transform applied to the reference view is applied identically to targets to preserve supervision consistency.
6. Quantitative Results and Comparison
Wild3R demonstrates superior performance to all camera-free feed-forward 3DGS methods, achieving results close to those of per-scene optimization-based approaches on the Photo Tourism benchmark (16 context views):
| Method | Unseen | Camera | Point | Time | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|---|---|---|
| NeRF-W [2021] | ✗ | ✗ | 30 h | 17.29 | 0.530 | 0.570 | |
| 3DGS [Kerbl '23] | ✗ | ✗ | 7.7 m | 13.56 | 0.437 | 0.560 | |
| GS-W [Zhang '24] | ✗ | ✗ | 24 m | 15.17 | 0.501 | 0.504 | |
| AsymGS [Li '25] | ✗ | ✗ | 30 m | 18.37 | 0.607 | 0.463 | |
| Long-LRM [Chen '25] | ✗ | ✓ | 0.18s | 15.26 | 0.486 | 0.569 | |
| AnySplat [Jiang '25] | ✓ | ✓ | 0.95s | 13.72 | 0.377 | 0.546 | |
| YoNoSplat [Ye '26] | ✓ | ✓ | 0.38s | 13.04 | 0.403 | 0.640 | |
| DA3 [Lin '26] | ✓ | ✓ | 1.6s | 14.03 | 0.420 | 0.586 | |
| Wild3R (Ours) | ✓ | ✓ | 0.95s | 15.87 | 0.435 | 0.506 |
Key findings:
- Wild3R is approximately 8 faster than optimization-based methods.
- Closes much of the PSNR/SSIM gap compared to "classical" 3DGS and NeRF-W.
- Outperforms all other camera-free feed-forward methods in perceptual (LPIPS) and pixel-wise (PSNR, SSIM) metrics.
7. Significance and Outlook
Wild3R's primary contributions are the demonstration that feed-forward 3DGS can match, or rival, per-scene optimization methods for unconstrained, sparse photo collections and that robust transient- and illumination-invariant scene representations can be learned using a reference-conditioned loss and carefully structured training data. The WildCity dataset represents a benchmark for future work on large-scale, robust 3D scene reconstruction in uncontrolled conditions (Furutani et al., 10 Jun 2026).