Papers
Topics
Authors
Recent
2000 character limit reached

ExpanDyNeRF: Monocular Dynamic Scene Reconstruction

Updated 18 December 2025
  • The paper introduces a dual-branch NeRF architecture with pseudo–ground-truth loss and 3D Gaussian splatting priors to achieve novel view synthesis despite extreme viewpoint deviations.
  • It employs a rigorous training pipeline combining temporal continuity, photometric, and super-resolution losses to robustly separate and reconstruct static and dynamic scene elements.
  • Empirical results show significant improvements over previous methods, with superior PSNR, FID, and LPIPS metrics on both synthetic and real-world dynamic scenes.

Expanded Dynamic NeRF (ExpanDyNeRF) is a monocular dynamic Neural Radiance Field (NeRF) framework designed for high-fidelity novel view synthesis of dynamic scenes under significant viewpoint deviations. The system advances the capability of monocular-only dynamic NeRFs by leveraging 3D Gaussian splatting priors and a pseudo–ground-truth generation strategy, achieving reliable scene reconstruction and photo-realistic renderings even when capturing scenes from a single, predominantly forward-facing camera with extreme viewpoint shifts up to and beyond ±45° (Jiang et al., 16 Dec 2025).

1. Architecture and System Components

ExpanDyNeRF is predicated on a dual-branch NeRF backbone augmented with a novel pseudo–ground-truth loss from Gaussian splatting priors to address the failure modes of prior systems under large view deviations.

  • NeRF Backbone: The architecture separates static and dynamic elements.
    • Static Branch (Nb\mathcal{N}_b): Processes all frames ItI_t, learning a shared time-invariant density σb(x)\sigma_b(x) and color field cb(x,d)c_b(x, d).
    • Dynamic Branch (Nf\mathcal{N}_f): Implements a 3-frame sliding-window NeRF; it takes (x,d,t)(x, d, t) and predicts time-varying density σf(x,d,t)\sigma_f(x,d,t) and appearance cf(x,d,t)c_f(x,d,t). Temporal coherence is enforced through an L2L_2 loss on adjacent densities.
  • Super-Resolution Module: Low-resolution patches Q^k\hat{Q}_k from volume rendering are upsampled with a pretrained SR model and compared with high-resolution ground truth patches QkQ_k.

The training process alternates or jointly optimizes backbone NeRF losses (on monocular views) with losses informed by the novel-view Gaussian priors.

2. Gaussian Splatting Priors and Alignment

Foremost in ExpanDyNeRF is the use of per-frame 3D Gaussian splatting priors, generated via FreeSplatter. Each foreground object is converted into a mesh of anisotropic Gaussians, yielding a volumetric proxy for appearance and geometry.

  • Novel View Dome Sampling: For each frame, the system samples a dome of camera poses PnvG(e,ϕ)P_{nv}^G(e, \phi) spanning e{0,15,30}e \in \{0^\circ, 15^\circ, 30^\circ\} (elevation) and ϕ[45,+45]\phi \in [-45^\circ, +45^\circ] (azimuth) in 5° steps, fixing the camera–object distance.
  • Pseudo–Ground-Truth Generation: The Gaussian splat scene is rendered from each novel pose to produce color c(r)c^*(r) and density σ(r)\sigma^*(r) for each ray.
  • Coordinate Alignment: Each novel pose is mapped into the NeRF world through a rigid transformation T=PfNeRF(PG)1T = P_f^{NeRF} \cdot (P^G)^{-1}, resulting in aligned rays for effective cross-domain supervision.

3. Loss Functions and Training Objective

The total objective is a composite of temporal, photometric, perceptual super-resolution, and novel-view supervision losses:

L=Lcont+Lrec+Lsr+LnvL = L_{cont} + L_{rec} + L_{sr} + L_{nv}

  • Temporal Continuity: Lcont=Etσf(x,d,t+1)σf(x,d,t)22L_{cont} = \mathbb{E}_t \|\sigma_f(x, d, t+1) - \sigma_f(x, d, t)\|_2^2
  • Primary-View Reconstruction: Lrec=t=1NrRtIt(r)I^t(r)22L_{rec} = \sum_{t=1}^N \sum_{r \in R_t} \|I_t(r) - \hat{I}_t(r)\|_2^2, using output from NbNf\mathcal{N}_b \oplus \mathcal{N}_f
  • Super-Resolution Perceptual Loss: Lsr=k=1Kllayers(ϕl(Q^k)ϕl(Qk)22)/ϕlL_{sr} = \sum_{k=1}^K \sum_{l \in layers} (\|\phi_l(\hat{Q}_k) - \phi_l(Q_k)\|_2^2) / |\phi_l|, utilizing a VGG-19 feature extractor
  • Novel-View Pseudo–Ground-Truth Loss:
    • Lnv=Lcnv+LσnvL_{nv} = L_c^{nv} + L_\sigma^{nv}, with
    • Lcnv=rRnvcf(r)c(r)22L_c^{nv} = \sum_{r \in R_{nv}} \|c_f(r) - c^*(r)\|_2^2
    • Lσnv=rRnvσf(r)σ(r)22L_\sigma^{nv} = \sum_{r \in R_{nv}} \|\sigma_f(r) - \sigma^*(r)\|_2^2
    • This loss is omitted for the initial E0E_0 epochs to avoid instability, then fully incorporated.

Hyperparameters are typically λcont=1.0\lambda_{cont}=1.0, λrec=1.0\lambda_{rec}=1.0, λsr=0.5\lambda_{sr}=0.5, λcnv=λσnv=0.1\lambda_c^{nv}=\lambda_\sigma^{nv}=0.1.

4. Synthetic Dynamic Multiview (SynDM) Dataset

ExpanDyNeRF’s novel-view supervision is enabled by SynDM, a synthetic multi-view dataset purpose-built using a GTA V-based rendering pipeline.

Key properties of SynDM:

Attribute Details
Multi-view sync 22 cameras (±45° azimuth at 5° steps, 3 at ±45° elevation)
Temporal consistency All cameras render in a single engine frame (0.2ms latency)
Scene variety Dynamic scenes (humans, vehicles, animals, unconstrained)
Camera motion Main camera exhibits natural motion
Resolution/FOV 1920×1080 pixels, 90×5990^\circ \times 59^\circ FOV
Supervision splits Train: first 24 frames; Test: 12 novel cams (±5…±30°)

This dataset facilitates held-out, large-deviation ground truth for rigorous evaluation beyond small baseline shifts.

5. Empirical Performance and Ablations

ExpanDyNeRF demonstrates superior performance to state-of-the-art dynamic NeRF and splatting methods on both SynDM and real-world datasets, particularly for large azimuth deviations.

Quantitative summary (SynDM, avg. over 9 scenes, ±30° test):

Method FID PSNR LPIPS
D3DGS 223.6 16.74 0.373
DecNeRF 225.7 15.78 0.516
RoDynRF 238.2 18.99 0.338
DetRF 239.0 20.34 0.436
ExpanDyNeRF 142.7 20.86 0.209

On real-world DyNeRF 11:

  • Scene “Coffee”: FID=132.4, PSNR=30.32, LPIPS=0.189 (2–6 point PSNR gain over best prior)
  • Scene “Beef”: FID=135.8, PSNR=34.92, LPIPS=0.195

On NVIDIA 35: ExpanDyNeRF is a close second (PSNR ≈29–30, LPIPS ≈0.08–0.03), indicating robust generalization.

Qualitatively, baseline models exhibit severe artifacts (fragmented geometry, “cardboard” flattening, misplaced limbs, or collapse of thin structures) when tested at ±30° deviations. ExpanDyNeRF uniquely maintains sharp geometry, correct occlusion, and temporal color stability under these conditions.

6. Implementation Considerations

  • Training is performed scene-wise for 300k iterations on dual NVIDIA A100 GPUs (approx. 15 h/scene).
  • Gaussian splatting priors are generated once per frame with negligible overhead.
  • The pseudo–ground-truth loss LnvL_{nv} is introduced after 50 warming epochs to allow backbone stabilization.

This sequence ensures the avoidance of gradient pathologies associated with noisy or unstable priors early in optimization.

7. Significance and Limitations

ExpanDyNeRF extends the applicability of dynamic NeRF frameworks to unconstrained, real-world video where only monocular, forward-facing camera input is available, and surpasses prior work in both metrics and qualitative stability under extreme viewpoint changes. The explicit use of 3D Gaussian splatting as a regularizing proxy, combined with the newly introduced SynDM dataset for large-deviation ground truth, establishes novel methodological and empirical standards in the field (Jiang et al., 16 Dec 2025).

A plausible implication is that while ExpanDyNeRF robustly outperforms competitors on wide-baseline evaluation, its reliance on synthetic multitarget supervision may limit effectiveness where no proxy of this kind can be generated or where the alignment between proxy and real scene is poor. Nonetheless, its ability to preserve geometry and appearance under challenging view synthesis scenarios suggests its core techniques may inspire further advances in monocular dynamic scene reconstruction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Expanded Dynamic NeRF (ExpanDyNeRF).