ExpanDyNeRF: Monocular Dynamic Scene Reconstruction

Updated 18 December 2025

The paper introduces a dual-branch NeRF architecture with pseudo–ground-truth loss and 3D Gaussian splatting priors to achieve novel view synthesis despite extreme viewpoint deviations.
It employs a rigorous training pipeline combining temporal continuity, photometric, and super-resolution losses to robustly separate and reconstruct static and dynamic scene elements.
Empirical results show significant improvements over previous methods, with superior PSNR, FID, and LPIPS metrics on both synthetic and real-world dynamic scenes.

Expanded Dynamic NeRF (ExpanDyNeRF) is a monocular dynamic Neural Radiance Field (NeRF) framework designed for high-fidelity novel view synthesis of dynamic scenes under significant viewpoint deviations. The system advances the capability of monocular-only dynamic NeRFs by leveraging 3D Gaussian splatting priors and a pseudo–ground-truth generation strategy, achieving reliable scene reconstruction and photo-realistic renderings even when capturing scenes from a single, predominantly forward-facing camera with extreme viewpoint shifts up to and beyond ±45° (Jiang et al., 16 Dec 2025).

1. Architecture and System Components

ExpanDyNeRF is predicated on a dual-branch NeRF backbone augmented with a novel pseudo–ground-truth loss from Gaussian splatting priors to address the failure modes of prior systems under large view deviations.

NeRF Backbone: The architecture separates static and dynamic elements.
- Static Branch ( $\mathcal{N}_b$ ): Processes all frames $I_t$ , learning a shared time-invariant density $\sigma_b(x)$ and color field $c_b(x, d)$ .
- Dynamic Branch ( $\mathcal{N}_f$ ): Implements a 3-frame sliding-window NeRF; it takes $(x, d, t)$ and predicts time-varying density $\sigma_f(x,d,t)$ and appearance $c_f(x,d,t)$ . Temporal coherence is enforced through an $L_2$ loss on adjacent densities.
Super-Resolution Module: Low-resolution patches $\hat{Q}_k$ from volume rendering are upsampled with a pretrained SR model and compared with high-resolution ground truth patches $Q_k$ .

The training process alternates or jointly optimizes backbone NeRF losses (on monocular views) with losses informed by the novel-view Gaussian priors.

2. Gaussian Splatting Priors and Alignment

Foremost in ExpanDyNeRF is the use of per-frame 3D Gaussian splatting priors, generated via FreeSplatter. Each foreground object is converted into a mesh of anisotropic Gaussians, yielding a volumetric proxy for appearance and geometry.

Novel View Dome Sampling: For each frame, the system samples a dome of camera poses $P_{nv}^G(e, \phi)$ spanning $e \in \{0^\circ, 15^\circ, 30^\circ\}$ (elevation) and $\phi \in [-45^\circ, +45^\circ]$ (azimuth) in 5° steps, fixing the camera–object distance.
Pseudo–Ground-Truth Generation: The Gaussian splat scene is rendered from each novel pose to produce color $c^*(r)$ and density $\sigma^*(r)$ for each ray.
Coordinate Alignment: Each novel pose is mapped into the NeRF world through a rigid transformation $T = P_f^{NeRF} \cdot (P^G)^{-1}$ , resulting in aligned rays for effective cross-domain supervision.

3. Loss Functions and Training Objective

The total objective is a composite of temporal, photometric, perceptual super-resolution, and novel-view supervision losses:

$L = L_{cont} + L_{rec} + L_{sr} + L_{nv}$

Temporal Continuity: $L_{cont} = \mathbb{E}_t \|\sigma_f(x, d, t+1) - \sigma_f(x, d, t)\|_2^2$
Primary-View Reconstruction: $L_{rec} = \sum_{t=1}^N \sum_{r \in R_t} \|I_t(r) - \hat{I}_t(r)\|_2^2$ , using output from $\mathcal{N}_b \oplus \mathcal{N}_f$
Super-Resolution Perceptual Loss: $L_{sr} = \sum_{k=1}^K \sum_{l \in layers} (\|\phi_l(\hat{Q}_k) - \phi_l(Q_k)\|_2^2) / |\phi_l|$ , utilizing a VGG-19 feature extractor
Novel-View Pseudo–Ground-Truth Loss:
- $L_{nv} = L_c^{nv} + L_\sigma^{nv}$ , with
- $L_c^{nv} = \sum_{r \in R_{nv}} \|c_f(r) - c^*(r)\|_2^2$
- $L_\sigma^{nv} = \sum_{r \in R_{nv}} \|\sigma_f(r) - \sigma^*(r)\|_2^2$
- This loss is omitted for the initial $E_0$ epochs to avoid instability, then fully incorporated.

Hyperparameters are typically $\lambda_{cont}=1.0$ , $\lambda_{rec}=1.0$ , $\lambda_{sr}=0.5$ , $\lambda_c^{nv}=\lambda_\sigma^{nv}=0.1$ .

4. Synthetic Dynamic Multiview (SynDM) Dataset

ExpanDyNeRF’s novel-view supervision is enabled by SynDM, a synthetic multi-view dataset purpose-built using a GTA V-based rendering pipeline.

Key properties of SynDM:

Attribute	Details
Multi-view sync	22 cameras (±45° azimuth at 5° steps, 3 at ±45° elevation)
Temporal consistency	All cameras render in a single engine frame (0.2ms latency)
Scene variety	Dynamic scenes (humans, vehicles, animals, unconstrained)
Camera motion	Main camera exhibits natural motion
Resolution/FOV	1920×1080 pixels, $90^\circ \times 59^\circ$ FOV
Supervision splits	Train: first 24 frames; Test: 12 novel cams (±5…±30°)

This dataset facilitates held-out, large-deviation ground truth for rigorous evaluation beyond small baseline shifts.

5. Empirical Performance and Ablations

ExpanDyNeRF demonstrates superior performance to state-of-the-art dynamic NeRF and splatting methods on both SynDM and real-world datasets, particularly for large azimuth deviations.

Quantitative summary (SynDM, avg. over 9 scenes, ±30° test):

Method	FID	PSNR	LPIPS
D3DGS	223.6	16.74	0.373
DecNeRF	225.7	15.78	0.516
RoDynRF	238.2	18.99	0.338
DetRF	239.0	20.34	0.436
ExpanDyNeRF	142.7	20.86	0.209

On real-world DyNeRF 11:

Scene “Coffee”: FID=132.4, PSNR=30.32, LPIPS=0.189 (2–6 point PSNR gain over best prior)
Scene “Beef”: FID=135.8, PSNR=34.92, LPIPS=0.195

On NVIDIA 35: ExpanDyNeRF is a close second (PSNR ≈29–30, LPIPS ≈0.08–0.03), indicating robust generalization.

Qualitatively, baseline models exhibit severe artifacts (fragmented geometry, “cardboard” flattening, misplaced limbs, or collapse of thin structures) when tested at ±30° deviations. ExpanDyNeRF uniquely maintains sharp geometry, correct occlusion, and temporal color stability under these conditions.

6. Implementation Considerations

Training is performed scene-wise for 300k iterations on dual NVIDIA A100 GPUs (approx. 15 h/scene).
Gaussian splatting priors are generated once per frame with negligible overhead.
The pseudo–ground-truth loss $L_{nv}$ is introduced after 50 warming epochs to allow backbone stabilization.

This sequence ensures the avoidance of gradient pathologies associated with noisy or unstable priors early in optimization.

7. Significance and Limitations

ExpanDyNeRF extends the applicability of dynamic NeRF frameworks to unconstrained, real-world video where only monocular, forward-facing camera input is available, and surpasses prior work in both metrics and qualitative stability under extreme viewpoint changes. The explicit use of 3D Gaussian splatting as a regularizing proxy, combined with the newly introduced SynDM dataset for large-deviation ground truth, establishes novel methodological and empirical standards in the field (Jiang et al., 16 Dec 2025).

A plausible implication is that while ExpanDyNeRF robustly outperforms competitors on wide-baseline evaluation, its reliance on synthetic multitarget supervision may limit effectiveness where no proxy of this kind can be generated or where the alignment between proxy and real scene is poor. Nonetheless, its ability to preserve geometry and appearance under challenging view synthesis scenarios suggests its core techniques may inspire further advances in monocular dynamic scene reconstruction.

PDF Markdown Chat (Pro)

References (1)

Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Expanded Dynamic NeRF (ExpanDyNeRF).