Papers
Topics
Authors
Recent
Search
2000 character limit reached

WildCity Dataset: Multi-Illumination Urban Scenes

Updated 14 June 2026
  • WildCity is a synthetic dataset of urban scenes featuring 337,500 images rendered under 30 lighting conditions and 50 views per scene.
  • It employs Blender Cycles for photorealistic renders and uses text-guided 2D editing to introduce transient objects like pedestrians and vehicles.
  • The dataset decouples illumination, geometry, and transient effects, enabling robust evaluation of feed-forward 3D neural rendering methods.

The WildCity dataset is a large-scale synthetic image dataset designed for learning robust scene representations from unconstrained sparse photo collections under diverse lighting conditions and transient scene variations. Developed in conjunction with the Wild³R feed-forward 3D Gaussian Splatting (3DGS) framework, WildCity addresses the need for training data exhibiting multi-view, multi-illumination, and transient object diversity, supporting the development of models capable of handling real-world photographic collections (Furutani et al., 10 Jun 2026).

1. Dataset Scope, Structure, and Content

WildCity comprises 337,500 total images, structured as follows: 200 unique urban scenes rendered across 30 lighting conditions (HDRI-based) per scene, with 50 camera viewpoints for each illumination, yielding 300,000 base renderings. Superimposed on these are 37,500 images augmented with view-specific transient objects, generated through 2D text-driven edits (Gemini) applied to 12.5% (f=0.125f = 0.125) of images.

Transient object categories include:

  • Pedestrians (individuals, groups, cyclists, delivery workers)
  • Vehicles (cars, taxis, vans, buses; both static and dynamic)
  • Construction features (traffic cones, barriers, sign stands, warning signs)
  • Building-attached graphics (banners, signboards, posters, surface cracks)

Lighting diversity is driven by 170 unique equirectangular HDRI maps ("LightCity" collection), with 30 sampled per scene without replacement, yielding an expected per-HDRI usage of approximately E[#appearance]=(200×30)/170≈35.3E[\#appearance] = (200 \times 30)/170 \approx 35.3 scenes. Each camera is placed within a 120∘120^\circ fan around the scene center, at 1–2 m height and 10–25 m radial distance, with dynamic fields of view (40∘40^\circ–100∘100^\circ). Renders utilize Blender Cycles (PBR) at 512×512512\times 512px and 512 spp, with full simulation of scattering phenomena: diffuse, glossy, transmissive, background, and emission components.

2. Data Generation Protocol

Scene geometry combines a base set of 11 building types (SceneCity) with over 130 supplementary Sketchfab models under permissive licenses. Nine procedural city templates were generated, each assigned a unique Sketchfab model subset, and 20–25 locations per template were manually selected for maximal street exposé, yielding 200 scenes. For each (scene, illumination), 50 viewpoint samples are drawn and rendered.

Transients are introduced via post-process text-guided 2D editing without enforcing 3D spatial-consistency across views. Each transient is instance-specific at the image level, supporting tests of robustness to occlusion and photometric variation.

3. File Organization and Metadata

While explicit filenames are not mandated, the recommended directory structure is as follows:

Path/Directory Contents or Role
WildCity/scenes.json Scene IDs and asset splits (200 scenes)
WildCity/hdris.json Metadata for 170 HDRI maps
WildCity/cameras/scene_{000...199}/... Camera intrinsics/extrinsics per view
WildCity/images/scene_{...}/illum_{...}/ Rendered RGB images (per illumination, view)
WildCity/depth/scene_{...}/illum_{...}/ EXR depth maps
WildCity/mask/scene_{...}/illum_{...}/ Sky masks per image
WildCity/transients.txt Index of images with transient augmentations
WildCity/metadata.csv scene_id, illum_id, view_id, filepaths, transient flag

Transients are flagged at the image level. No official train/validation/test splits are provided; the dataset was used in its entirety for supervised model training, with cross-scene splits (e.g., 160 train / 20 validation / 20 test) recommended for controlled evaluation.

4. Statistical Summary

Let S=200S = 200 (scenes), Is=30I_s = 30 (illuminations per scene), V=50V = 50 (views per illumination), and f=0.125f = 0.125 (fraction with transients):

  • Total base images: E[#appearance]=(200×30)/170≈35.3E[\#appearance] = (200 \times 30)/170 \approx 35.30
  • Transient images: E[#appearance]=(200×30)/170≈35.3E[\#appearance] = (200 \times 30)/170 \approx 35.31
  • Total images: E[#appearance]=(200×30)/170≈35.3E[\#appearance] = (200 \times 30)/170 \approx 35.32

Transient augmentation is binomially distributed per scene, with E[#appearance]=(200×30)/170≈35.3E[\#appearance] = (200 \times 30)/170 \approx 35.33. The uniform distribution of HDRI maps minimizes illumination bias across the dataset.

5. Licensing, Access, and Usage Constraints

WildCity is publicly released at https://furuschool.github.io/wild3r-page. The dataset incorporates:

  • Sketchfab models (various permissive licenses)
  • LightCity HDRIs
  • SceneCity building assets (licensed separately as required)

The overall distribution is under a "permissive, academic-use license" (CC BY 4.0 or equivalent). Legal use restricts data to non-commercial research, with attribution required; users must consult the official download page for precise legal details.

6. Downstream Tasks, Evaluation Metrics, and Benchmarks

WildCity provides training support for feed-forward 3DGS systems, most notably Wild³R. Its efficacy is assessed on:

  • Photo Tourism (Brandenburg Gate, Sacre Coeur, Trevi Fountain), with E[#appearance]=(200×30)/170≈35.3E[\#appearance] = (200 \times 30)/170 \approx 35.34 input view settings
  • NeRF-OSR (4 scenes, 16 views each)

Performance metrics are:

  • PSNR (E[#appearance]=(200×30)/170≈35.3E[\#appearance] = (200 \times 30)/170 \approx 35.35)
  • SSIM (structural similarity index)
  • LPIPS (learned perceptual image patch similarity; lower is better)

Wild³R, trained on WildCity, achieves state-of-the-art results among feed-forward, camera-agnostic approaches, approaching the output quality of optimization-based pipelines while enabling sub-second scene reconstruction. In Photo Tourism, Wild³R attains, for example, PSNR/SSIM/LPIPS of 15.87/0.435/0.506 for 16 views, and 16.29/0.458/0.477 for 64 views—substantially surpassing previous feed-forward or camera-based methods in reconstruction quality versus inference speed (Furutani et al., 10 Jun 2026).

7. Context, Significance, and Limitations

WildCity is the first dataset to provide multi-view, multi-illumination, transient-augmented urban imagery at city scale with explicit, high-fidelity ground truth for camera parameters, depth, and scene segmentation. Its structured design allows for systematic evaluation of robustness to lighting variation and transient occlusion, alongside rigorous control over geometry and material composition. The decoupling of illumination and transients from geometry enables targeted analysis of invariance and disentanglement in neural rendering pipelines.

A key limitation is that 2D transient augmentations lack cross-view 3D consistency, reflecting known challenges in real-world Internet photo collections; a plausible implication is that future extensions may require consistent 3D transient object insertions for enhanced realism and generalizability.

WildCity has established a new standard in synthetic benchmarking for scene representation learning under unconstrained and photometrically complex real-world conditions, forming the backbone for advancements in generalizable feed-forward neural rendering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WildCity Dataset.