Vision-Only Dynamic NeRF for Urban Scenes

Updated 16 November 2025

The paper introduces a novel framework that jointly models static infrastructures and dynamic objects using self-supervised decomposition and explicit 3D scene flow.
The model employs two branches—static and dynamic—that fuse via learned weights to enhance novel view synthesis, achieving improvements up to +4.20 dB PSNR.
The approach robustly estimates monocular camera poses and separates dynamic regions, setting new state-of-the-art metrics on large-scale urban datasets.

Vision-only Dynamic Neural Radiance Field (VDNeRF) refers to a class of neural scene representations and learning frameworks that reconstruct and render photorealistic views of dynamic, real-world environments from monocular RGB imagery, entirely without external pose/sensor data or explicit multi-view geometric priors. VDNeRF approaches jointly estimate camera trajectories, leverage explicit 3D scene flow for dynamic regions, and utilize rigorous self-supervised decompositions to robustly separate static backgrounds and independently moving objects. Contemporary VDNeRF models achieve state-of-the-art synthesis and localization accuracy in large-scale urban scenarios, establishing a standard for vision-only dynamic scene understanding (Zou et al., 9 Nov 2025).

1. Fundamental Methodology and Rationale

The core principle underlying VDNeRF architectures is the joint modeling of static and dynamic components via two explicitly separated neural radiance fields:

Static NeRF $\mathcal{F}^s_\Theta$ : Encodes static infrastructure (buildings, roads, background) as a volumetric radiance field indexed by spatial coordinates and viewing direction. This branch also learns camera poses $\{P_i\}_{i=1}^N$ , initialized randomly and optimized via photometric and auxiliary geometric losses, without using any extrinsic pose priors or Structure-from-Motion outputs.
Dynamic NeRF $\mathcal{F}^d_\Theta$ : Represents dynamic objects and regions, accounting for independent motion using spatial coordinates, time index $t$ , and a learned scene flow field $f_\Theta^{\text{flow}}(x,t)$ that predicts local forward/backward 3D displacements.

This decomposition mitigates the common ambiguity encountered in dynamic scene reconstruction—namely, the difficulty in distinguishing between camera movement and object motion from monocular input. VDNeRF resolves this by explicitly optimizing poses on static regions (which are reliably matched across frames), and using scene flow to capture true object deformation and displacement.

2. Model Architecture and Scene Representation

Static Model: Input $(x, d)$ is encoded (commonly using multi-resolution hash grids as in Instant-NGP) and mapped via MLPs to feature vectors and density $\sigma_s(x)$ ; color is predicted by a shared MLP $\text{ColorMLP}(F_s(x), d)$ . Camera pose $P_i$ is a learnable parameter for each image frame.
Dynamic Model: For every query $(x, d, t)$ $(x, d, t)$ , flow MLPs output both features $F_d(x, t)$ $F_{d} (x, t)$ and densities $\sigma_d(x, t)$ $σ_{d} (x, t)$ . To ensure temporal consistency and fidelity, dynamic features/densities are aggregated from temporal neighbors—using scene flow:
- $F_d^{t-1} \gets F_d(x + v_b, t-1)$ , $F_d^{t+1} \gets F_d(x + v_f, t+1)$
- Aggregated feature $\hat F_d^t$ and density $\hat \sigma_d^t$ are constructed as convex combinations of center and neighbors.
Fusion Mechanism: The final rendered pixel blends static and dynamic predictions using a learned shadow weight $\rho(x,t) \in [0,1]$ , yielding:

$\sigma = \sigma_s + \sigma_d, \quad c = (1-\rho)\frac{\sigma_s}{\sigma}c_s + \frac{\sigma_d}{\sigma}c_d$

where $c_s$ and $c_d$ are static and dynamic colors.

Volumetric Rendering:

$C(r) = \int_{t_n}^{t_f} T(t)\,\sigma(x(t))\,c(x(t),d)\,dt, \quad T(t) = \exp\left(-\int_{t_n}^t \sigma(x(s))ds\right)$

where $x(t) = o + t\,d$ along ray $r$ from the pose $P_i$ .

3. Self-supervised Decomposition and Pose Estimation

VDNeRF employs a staged progressive training pipeline with sub-scene partitioning to ensure high-fidelity decomposition:

Stage I: Static-only optimization on small subscene windows—jointly fitting $\mathcal{F}^s_\Theta$ and poses via photometric, depth, and optical-flow regularization losses. Motion masks (e.g., derived from RoDynRF) exclude dynamic pixels from these losses to prevent entanglement.
Stage II: Dynamic branch and flow field are activated once static convergence is reached. Poses are frozen; the optimization now includes scene-flow regularization (cycle-consistency $L_{\text{cycle}}$ ), dynamic sparsity $L_{\text{dynamic}}$ , and shadow regularization $L_{\text{shadow}}$ .
Stage III: Window slides forward, with last frames and their poses overlapping to ensure trajectory continuity and global consistency.

Auxiliary cues such as monocular depth priors (e.g., DPT) and optical flow (e.g., RAFT) are incorporated to address scale ambiguities and constrain motion consistency. All parameters are optimized end-to-end; flow and shadow weights are annealed to avoid over-regularization on noisy initial priors.

4. Learning Objectives and Regularization

Loss terms are explicitly constructed for decomposed optimization:

Photometric loss (color reconstruction): $L_{\text{color}} = \sum_r \| \hat C(r) - C_{\text{gt}}(r)\|_2^2$
Depth regularization: $L_{\text{depth}} = \sum_r \|\hat D^*(r) - D^*(r)\|_2^2$ (using scale-normalized rendered and pretrained depths)
Optical flow matching: $L_{\text{flow2D}}$ matches NeRF-projected 2D flows to pretrained flow estimates.
Dynamic-specific losses:
- Cycle-consistency: $L_{\text{cycle}}$ penalizes discrepancies in forward-backward scene flow cycles.
- Dynamic sparsity: $L_{\text{dynamic}}$ encourages minimal density in the dynamic branch, preventing over-segmentation.
- Shadow regularization: $L_{\text{shadow}}$ penalizes overuse of dynamic channel.

Weights are annealed through training stages, with masking and freezing schedules ensuring disentanglement of camera and object motion.

5. Experimental Results and Benchmarking

VDNeRF is evaluated on large-scale urban driving datasets such as NOTR (Waymo-derived) and Pandaset, with up to 200 frames per sequence.

Novel View Synthesis (Zou et al., 9 Nov 2025):
- On NOTR: VDNeRF yields a PSNR increase of +2.12 dB and a reduction of LPIPS by 0.057 over the next best method.
- On Pandaset: Gains of +4.20 dB PSNR and SSIM improvement +0.088.
Camera Pose Estimation:
- ATE on NOTR: VDNeRF achieves ~0.33m vs. LocalRF’s ~1.28m.
- EmerNeRF (oracle GT poses) is outperformed by VDNeRF in dynamic region fidelity, despite VDNeRF’s pose-free input.
Ablation: Masking dynamics during pose optimization, freezing poses when activating the dynamic branch, and sub-scene partitioning are all critical for high final accuracy.

6. Strengths, Limitations, and Extensions

Strengths:
- Fully vision-only operation—no LiDAR, GPS, or pose sensors.
- Robust pose estimation even in long, dynamic urban sequences.
- High-fidelity separation and rendering of moving objects via explicit 3D scene flow.
- Scalability with sub-scene overlapping windows for arbitrarily long environments.
Limitations:
- Failure modes under extreme rotational camera motion (in-place spins) due to lack of view overlap.
- Coarse granularity of shadow weight $\rho$ can lead to over-segmentation in clustered dynamic objects.
- Runtime: Approximately 12 hours per 200 frames on a single RTX 3090 GPU, limiting real-time application.
Potential Extensions:
- Incorporation of continual-learning regimes for cross-segment adaptation.
- Replacement of heuristic motion masks with learned dynamic attention mechanisms.
- Multi-scale/Block-NeRF-style architectures for vastly expanded urban coverage.

This suggests that the adoption of explicit static/dynamic decomposition, self-supervised scene flow, and multi-stage training is central to overcoming monocular ambiguities and achieving state-of-the-art performance in vision-only dynamic NeRFs for complex urban environments.

PDF Markdown Chat (Pro)

References (1)

VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Vision-only Dynamic NeRF (VDNeRF).