Novel View Synthesis Benchmark

Updated 30 June 2025

Novel view synthesis benchmark is a structured evaluation framework that uses diverse datasets, precise calibration, and standardized protocols to assess synthetic view generation.
It emphasizes photorealism, generalization, and robustness by integrating metrics such as PSNR, SSIM, and LPIPS to capture visual fidelity and perceptual quality.
The framework drives innovation in computer vision and graphics by enabling reproducible comparisons and guiding the development of more robust and accurate view synthesis methods.

Novel view synthesis benchmark refers to the rigorous evaluation infrastructure, datasets, and protocols devised to assess and compare algorithms that generate images of a scene from camera viewpoints not present in the original acquisition. These benchmarks underpin progress in computer vision and computational graphics by enabling standardized, reproducible, and meaningful assessment of synthetic view generation, with particular focus on photorealism, generalization, and robustness to real-world conditions.

1. Foundations and Evolution of Novel View Synthesis Benchmarks

Benchmarks for novel view synthesis (NVS) originally concentrated on controlled, small-scale, and often object-centric settings. Early influential datasets included ShapeNet for synthetic object rotations and KITTI for urban driving scenes with limited viewpoint diversity. Initial evaluation protocols measured $L_1$ error on masked object regions and basic perceptual judgments. As view synthesis algorithms evolved—from direct RGB generation to geometry-based and then to advanced neural rendering—the benchmarks expanded in scale and complexity to address new challenges: large baselines, real-world scene diversity, and the need for perceptual fidelity.

Recent years have witnessed the emergence of large-scale, richly annotated, and carefully designed benchmarks targeting increasingly realistic and challenging scenarios:

RTMV introduced a synthetic dataset with 300,000 high-resolution (1600×1600) rendered images from nearly 2,000 complex scenes, leveraging high-quality ray tracing for photorealism. It provided varied lighting, camera trajectory, and shape/material input, enabling large-scale training and evaluation using consistent ground truth for geometry, normals, and albedo.
OMMO established a real-world outdoor multi-modal benchmark, including 33 scenes, diverse urban and natural environments, calibrated images, dense point clouds, and text prompts. This multi-modality supports research in classic NVS, implicit scene reconstruction, and emerging multimodal NeRF approaches.
WayveScenes101 was designed for autonomous driving, providing 101 driving scenes each with 5 synchronized RGB cameras, extensive metadata (weather, time, traffic, occlusion), and explicit evaluation for off-axis, large-baseline extrapolation, thus stressing generalization in dynamic, in-the-wild settings.
EUVS ("Extrapolated Urban View Synthesis Benchmark") focused on the most challenging scenario: novel views positioned far outside the training distribution ("extrapolation"), aggregating real-world multi-traversal, multi-agent, multi-camera data from NuPlan, Argoverse 2, and MARS for over 90,000 frames across 104 urban scenarios.

This progression reflects the community’s drive toward practical, large-scale, and semantically complex applications beyond toy examples.

2. Benchmark Datasets and Evaluation Protocols

Benchmarks for NVS typically provide:

Large and diverse image sets: Spanning controlled object rotations (ShapeNet), urban driving sequences (KITTI, OMMO, WayveScenes101), photorealistic syntheses (RTMV), and multi-modal formats (images, depth, text).
Accurate calibration: Camera intrinsics and extrinsics, essential for methods leveraging explicit or implicit geometry.
Ground-truth geometry: Dense point clouds (COLMAP or LiDAR), depth, or mesh proxies, facilitating geometry-based regularization and evaluation.
Challenging acquisition settings: Varied environments (urban, natural), scales (from meters to kilometers), lighting, moving occluders, and real-world artifacts such as glare and exposure changes.
Standard splits and protocols: Carefully defined train/validation/test splits, often holding out viewpoints with significant offset for off-axis generalization assessment. For example, WayveScenes101 trains on four peripheral cameras and tests on the off-axis frontal camera; EUVS explicitly benchmarks extrapolated settings.

Performance metrics are a central pillar, with standard choices such as:

PSNR (Peak Signal-to-Noise Ratio): Measures photometric fidelity, sensitive to pixel-level errors.
SSIM (Structural Similarity): Captures structural and perceptual similarity.
LPIPS (Learned Perceptual Image Patch Similarity): Uses deep feature embeddings to measure perceptual discrepancy.
FID (Fréchet Inception Distance): Represents distributional similarity for generative models.
Task-specific and high-level metrics: DreamSim, foreground/background sensitivity (for perceptual bias), and segmentation/classification-focused metrics are recommended when utility is aligned with downstream tasks.

Recent investigations (2506.12563) highlight that classical metrics (SSIM, PSNR) can be problematic for practical NVS assessment, especially in real-world scenarios where minor alignment and lighting differences do not affect human-perceived usefulness. DreamSim and similar perceptual-level metrics are preferred for their robustness to minor artifacts and task-alignment.

3. Key Principles and Methodological Impact

A well-designed benchmark guides both method development and system design in NVS:

Controlling for Generalization: By evaluating not only interpolated (close-to-training) views, but also challenging extrapolated and off-axis scenes, benchmarks force methods to avoid overfitting and instead develop robust scene representations (2412.05256).
Task-Specific Design: Benchmarks are now tailored to practical application domains—urban robotics, autonomous driving, VR—that involve dynamic content, occlusions, and lighting variability (2407.08280, 2412.05256).
Multi-modal reasoning: Including auxiliary data (point clouds, text, metadata) supports new research in multimodal and semantically-informed NVS (2301.06782).
Synthetic Data Utility: High-quality rendered datasets (e.g., RTMV) enable both cost-effective large-scale training and precise control of ground truth for model calibration, especially when real-world annotation is labor-intensive (2205.07058).

A related development is the introduction of augmentation pipelines (e.g., Aug3D (2501.06431)), addressing low overlap and coverage in real-world scenes by sampling synthetic or semantic-based views using Structure-from-Motion-provided geometry.

4. Benchmark Results and Comparative Findings

Benchmarks yield comprehensive comparative results, providing insights into both current method limitations and directions for improvement:

Benchmark/Dataset	Methodology	Generalization Issue (Extrapolation)	Key Performance Metrics	Robustness/Finding
RTMV	SVLF, NeRF, pixelNeRF	Fast training/inference, synthetic data	PSNR/SSIM/LPIPS	SVLF: Comparable quality, 10–80× faster than NeRF
OMMO	NeRF, NeRF++, Mip-NeRF	Outdoor, multi-modal, calibration	PSNR/SSIM/LPIPS	Mip-NeRF 360: Best mean PSNR/SSIM on large scenes
WayveScenes101	NVS methods	Off-axis, dynamic, real-world scenes	PSNR/SSIM/LPIPS/FID	Benchmarking focus: generalization on held-out cam
EUVS	3DGS, NeRF variants	Large viewpoint extrapolation	PSNR/SSIM/LPIPS/Feature sim	All SOTA degrade sharply with extrapolation
GigaNVS (XScale-NVS)	Surface hash encoding	Cross-scale real-world, mesh robustness	LPIPS/PSNR/SSIM	40% lower LPIPS than prior baselines, mesh vs. UV

A repeated finding (2412.05256, 2407.08280) is that even state-of-the-art models show severe performance degradation—up to 30% in PSNR and 70% in LPIPS—when evaluated on extrapolated or dynamic scenes outside the training distribution.

5. Applications and Domain-Specific Benchmarks

NVS benchmarks now underpin a wide array of applications:

Autonomous Driving: WayveScenes101 and EUVS directly target robustness to dynamic objects, glare, and large-baseline extrapolation, vital for safe planning and simulation in real-world traffic.
Virtual and Augmented Reality: Methods evaluated on GigaNVS enable seamless exploration and inspection of real-world scenes from arbitrary perspectives in VR settings.
Thermal and Multi-modal Fusion: Veta-GS (2505.19138) introduces tasks and benchmarks for TIR novel-view synthesis, dealing with complex transmission and emissivity variations.
Cross-scale and Dense Urban Mapping: XScale-NVS and GigaNVS focus on high-fidelity, multi-scale generalization without reliance on mesh resolution or UV distortions, facilitating digital twin construction and city-scale modeling.
Artifact-Robustness: Datasets such as ACC-NVS1 (2503.18711) and new close-up view benchmarks (2503.15908) target robustness to transient occlusions, motion, and close-range extrapolation.

6. Open Challenges, Limitations, and Research Directions

Despite the progress, several challenges remain central to the benchmarking and development of NVS:

Extrapolation and Overfitting: Even advanced representations such as 3D Gaussian Splatting and NeRF variants overfit to training views and degrade substantially when synthesizing highly novel or close-up views (2503.15908, 2412.05256).
Metric Limitations: The community is re-evaluating reliance on pixel-level metrics (PSNR, SSIM), as these do not adequately reflect human perceptual utility in real-world tasks. Robust perceptual metrics such as DreamSim are advocated (2506.12563).
Benchmark Scale and Coverage: Ensuring sufficient coverage of scene diversity, including nighttime, adverse weather, and extreme dynamic content, is an ongoing effort.
Standardization: Varied protocols and splits across datasets impede direct comparison. Emerging benchmarks encourage standardized protocols—fixed train/test splits, metadata-rich evaluation, and meaningful challenge subsets.
Synthetic/Real Bridging: While synthetic datasets enable scale and control, closing the domain gap with real imagery remains crucial for robust deployment.

Future directions likely include:

Larger, more diverse, and richer benchmarks for particular applications (autonomous vehicles, robotics, AR/VR).
Specialized benchmarks for modality-extended NVS (thermal, LiDAR, multimodal).
Automated, perceptual-quality-aligned metric development and potential integration of human rating pipelines.
Emphasis on extrapolation, semantic consistency, and practical artifact-resilience in both evaluation and model design.

7. Summary Table: Major Public NVS Benchmarks

Benchmark	Scene Type	Images	Viewpoint Diversity	Special Features
ShapeNet, KITTI	Object/Driving	3K–10K	Small	Synthetic/Urban driving
RTMV	Synthetic 3D	300K	Large	Ray-tracing, materials, textures
OMMO	Real-world Outdoor	14K	Large	Multi-modal, text, urban/natural
WayveScenes101	Driving (Real)	101K	5 cams, off-axis	Rich metadata, dynamic elements
EUVS	Urban/Driving	90K	Extrapolated	Multi-agent/traversal/camera
GigaNVS	Large-scale Real	7 scenes	Cross-scale	5K–8K, mesh-free hash features
ACC-NVS1	Air & Ground	148K	Multi-sensor	Occlusion-rich, paired coverage
Close-up NVS (2503.15908)	Indoor/Outdoor	14K	Close-up	Artifact-sensitive benchmarking

Benchmarks such as OMMO, WayveScenes101, and EUVS combine broad scene coverage, challenging acquisition conditions, standardized protocols, and robust metric frameworks to anchor state-of-the-art progress in novel view synthesis.

Novel view synthesis benchmarking now demands large, diverse datasets, precise calibration, perceptually meaningful metrics, and protocols exposing both generalization and robustness. As new applications and modalities emerge, these benchmarks will remain central to the advancement and deployment of realistic, generalizable NVS algorithms.