Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

Published 8 Apr 2026 in cs.CV | (2604.07250v1)

Abstract: Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces Geo-EVS, a geometry-conditioned framework that combines deterministic reprojection and artifact-guided latent diffusion to extrapolate novel views from sparse supervision.
The methodology utilizes Geometry-Aware Reprojection for dense 3D point extraction and Artifact-Aware Training with simulated masks, achieving enhanced sparse-PSNR and sparse-SSIM metrics.
Geo-EVS proves its practical value in autonomous driving by significantly boosting downstream 3D detection performance through augmented view synthesis.

Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

Introduction and Motivation

This paper addresses the problem of extrapolative novel view synthesis (NVS) for autonomous driving, particularly the challenge of generating virtual camera views at poses outside the recorded distribution. Heterogeneous camera rigs in autonomous vehicles necessitate camera-agnostic data reuse, yet most approaches require dense or trajectory-constrained supervision. Existing methods—both reconstruction-based (e.g., NeRF, Gaussian splatting) and generative diffusion models—struggle with extrapolation where geometric support is weak and ground truth is typically unavailable for novel poses.

The authors introduce Geo-EVS, a geometry-conditioned framework for extrapolative NVS that explicitly tackles sparse supervision and train–test distribution mismatch. By combining deterministic reprojection-based conditioning and artifact-aware generative modeling, the approach aims to unify the projection pathway during both training and inference and enhance robustness in unsupported geometric regions. Geo-EVS is evaluated rigorously on the Waymo Open Dataset with protocols that reflect the lack of dense labels for novel views and provides downstream assessment via 3D detection.

Methodology

Geometry-Aware Reprojection (GAR)

GAR constitutes the data preprocessing backbone. It utilizes the Visual Geometry Grounded Transformer (VGGT) to extract dense 3D point clouds from monocular images, subsequently projecting these colored points into target poses to construct geometric condition maps. During training, these maps are generated for observed poses, paired with the corresponding RGB image. At inference, the identical reprojection interface is employed for unseen, extrapolated poses.

The reprojection process enforces strong constraints (positive depth, in-frame, and z-buffer visibility) to ensure geometric consistency. Invalid pixels are set to zero, resulting in sparse, structure-aligned condition maps that maintain calibration consistency across the data pipeline.

Artifact-Guided Latent Diffusion (AGLD)

For geometry-to-image synthesis, Geo-EVS builds on a conditional latent diffusion model. The model inputs the noisy latent target (from the image to be synthesized) along with the geometric condition encoding, concatenated at the input layer to enforce pixel-wise spatial alignment. Loss is computed on standard denoising objective, with supplementary conditional dropout for classifier-free guidance at inference.

Notably, missing regions in the geometric condition leave the diffusion model unconstrained at those pixels, requiring the network to rely on priors learned from the data. This property motivates explicit exposure to extrapolation-like artifacts during training.

Artifact-Aware Training

To simulate the artifacts and missing support inherent to extrapolative settings, the method employs an artifact-mask library. These masks, derived from virtual-pose reprojected views, are stochastically injected into geometric condition maps during training to create inputs with morphology-matched holes and discontinuities.

This augmentation aligns the train-time input distribution with the defects observed during extrapolative inference, enhancing robustness. The approach outperforms generic random masking strategies, as shown in ablation studies.

Evaluation: LiDAR-Projected Sparse-Reference Protocol

Given the unavailability of dense ground-truth images at novel poses, the authors adopt a LiDAR-Projected Sparse-Reference (LPSR) protocol. LiDAR point clouds are projected onto the extrapolated view, serving as a sparse but accurate photometric reference. All reconstruction metrics (e.g., sparse-PSNR, sparse-SSIM) are computed only over valid, LiDAR-supported pixels. This evaluation is more indicative of real-world applicability in automotive vision.

Experimental Results

In-Manifold and Extrapolative Performance

Geo-EVS is compared against several baselines including 3DGS, Street Gaussians, EmerNeRF, and FreeVS, spanning both reconstruction and generative paradigms. In standard in-manifold (trajectory-consistent) settings, Geo-EVS achieves state-of-the-art FID (3.9), substantially ahead of both generative and reconstruction-oriented competitors.

For extrapolated novel views not aligned with training trajectories, Geo-EVS maintains the highest sparse-PSNR and sparse-SSIM metrics among all assessed baselines: 23.65 and 0.941, respectively. The approach shows graceful degradation under increasing pose offset and geometric sparsity, a testament to its artifact-aware training.

Downstream 3D Detection

Geo-EVS-generated intermediate views significantly enhance the performance of a BEVFormer 3D detector: inclusion of 40 generated views alongside 5 native cameras yields +1.0 LET-mAP $_\mathrm{L}$ , +0.8 LET-mAPH, and +1.3 L1/mAPH improvements. This demonstrates practical utility in augmenting datasets for downstream perception tasks, validating the semantic and geometric fidelity of the synthesized images.

Ablation Studies and Limitations

Ablation analysis establishes that using reprojection-derived artifact masks confers a clear benefit (+1.92 dB S-PSNR over no-masking baseline), outperforming random occlusion masking. The morphology of missing support (structure of defects) directly affects generalization to extrapolative settings.

Remaining failure cases include inconsistent dynamic object generation, significant hallucinations under severe geometric sparsity, and artifacts near occlusion boundaries associated with limited point cloud detail. These limitations point toward necessary future work in multi-frame fusion and improved temporal/semantic consistency.

Implications and Future Directions

This research reframes the extrapolative NVS challenge for autonomous driving as a geometry-conditioned generative modeling problem under sparse supervision. The pipeline's explicit modeling of projection-induced defects and its well-structured evaluation protocol set a methodological precedent for dataset-lean view synthesis. By demonstrating utility for both image-level synthesis and downstream tasks, Geo-EVS highlights the importance of robust geometry-aware generative models in cross-platform vision for autonomous systems.

Theoretical extensions may include leveraging temporal geometric cues, multi-frame aggregation, or 4D space-time representations, as well as exploring alternative conditioning mechanisms beyond masking. Practically, such systems could expedite dataset harmonization for fleet-wide perception, reducing the cost of sensor platform upgrades and enabling more standardized 3D scene understanding.

Conclusion

Geo-EVS presents a cohesive, geometry-aware framework for extrapolative view synthesis with direct applicability to real-world autonomous driving datasets. By aligning training and inference via unified geometric projection and explicitly addressing the challenge of missing-support artifacts, the approach sets new benchmarks for both visual fidelity and geometric consistency in sparse-reference regimes. The method's demonstrated utility for downstream 3D detection solidifies its relevance for practical deployment. Future work should concentrate on improving performance under extreme sparsity and handling dynamic scene content through richer temporal modeling and multi-frame conditioning.

Reference: "Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving" (2604.07250)

Markdown Report Issue