ExpanDyNeRF: Monocular Dynamic Scene Reconstruction
- The paper introduces a dual-branch NeRF architecture with pseudo–ground-truth loss and 3D Gaussian splatting priors to achieve novel view synthesis despite extreme viewpoint deviations.
- It employs a rigorous training pipeline combining temporal continuity, photometric, and super-resolution losses to robustly separate and reconstruct static and dynamic scene elements.
- Empirical results show significant improvements over previous methods, with superior PSNR, FID, and LPIPS metrics on both synthetic and real-world dynamic scenes.
Expanded Dynamic NeRF (ExpanDyNeRF) is a monocular dynamic Neural Radiance Field (NeRF) framework designed for high-fidelity novel view synthesis of dynamic scenes under significant viewpoint deviations. The system advances the capability of monocular-only dynamic NeRFs by leveraging 3D Gaussian splatting priors and a pseudo–ground-truth generation strategy, achieving reliable scene reconstruction and photo-realistic renderings even when capturing scenes from a single, predominantly forward-facing camera with extreme viewpoint shifts up to and beyond ±45° (Jiang et al., 16 Dec 2025).
1. Architecture and System Components
ExpanDyNeRF is predicated on a dual-branch NeRF backbone augmented with a novel pseudo–ground-truth loss from Gaussian splatting priors to address the failure modes of prior systems under large view deviations.
- NeRF Backbone: The architecture separates static and dynamic elements.
- Static Branch (): Processes all frames , learning a shared time-invariant density and color field .
- Dynamic Branch (): Implements a 3-frame sliding-window NeRF; it takes and predicts time-varying density and appearance . Temporal coherence is enforced through an loss on adjacent densities.
- Super-Resolution Module: Low-resolution patches from volume rendering are upsampled with a pretrained SR model and compared with high-resolution ground truth patches .
The training process alternates or jointly optimizes backbone NeRF losses (on monocular views) with losses informed by the novel-view Gaussian priors.
2. Gaussian Splatting Priors and Alignment
Foremost in ExpanDyNeRF is the use of per-frame 3D Gaussian splatting priors, generated via FreeSplatter. Each foreground object is converted into a mesh of anisotropic Gaussians, yielding a volumetric proxy for appearance and geometry.
- Novel View Dome Sampling: For each frame, the system samples a dome of camera poses spanning (elevation) and (azimuth) in 5° steps, fixing the camera–object distance.
- Pseudo–Ground-Truth Generation: The Gaussian splat scene is rendered from each novel pose to produce color and density for each ray.
- Coordinate Alignment: Each novel pose is mapped into the NeRF world through a rigid transformation , resulting in aligned rays for effective cross-domain supervision.
3. Loss Functions and Training Objective
The total objective is a composite of temporal, photometric, perceptual super-resolution, and novel-view supervision losses:
- Temporal Continuity:
- Primary-View Reconstruction: , using output from
- Super-Resolution Perceptual Loss: , utilizing a VGG-19 feature extractor
- Novel-View Pseudo–Ground-Truth Loss:
- , with
- This loss is omitted for the initial epochs to avoid instability, then fully incorporated.
Hyperparameters are typically , , , .
4. Synthetic Dynamic Multiview (SynDM) Dataset
ExpanDyNeRF’s novel-view supervision is enabled by SynDM, a synthetic multi-view dataset purpose-built using a GTA V-based rendering pipeline.
Key properties of SynDM:
| Attribute | Details |
|---|---|
| Multi-view sync | 22 cameras (±45° azimuth at 5° steps, 3 at ±45° elevation) |
| Temporal consistency | All cameras render in a single engine frame (0.2ms latency) |
| Scene variety | Dynamic scenes (humans, vehicles, animals, unconstrained) |
| Camera motion | Main camera exhibits natural motion |
| Resolution/FOV | 1920×1080 pixels, FOV |
| Supervision splits | Train: first 24 frames; Test: 12 novel cams (±5…±30°) |
This dataset facilitates held-out, large-deviation ground truth for rigorous evaluation beyond small baseline shifts.
5. Empirical Performance and Ablations
ExpanDyNeRF demonstrates superior performance to state-of-the-art dynamic NeRF and splatting methods on both SynDM and real-world datasets, particularly for large azimuth deviations.
Quantitative summary (SynDM, avg. over 9 scenes, ±30° test):
| Method | FID | PSNR | LPIPS |
|---|---|---|---|
| D3DGS | 223.6 | 16.74 | 0.373 |
| DecNeRF | 225.7 | 15.78 | 0.516 |
| RoDynRF | 238.2 | 18.99 | 0.338 |
| DetRF | 239.0 | 20.34 | 0.436 |
| ExpanDyNeRF | 142.7 | 20.86 | 0.209 |
On real-world DyNeRF 11:
- Scene “Coffee”: FID=132.4, PSNR=30.32, LPIPS=0.189 (2–6 point PSNR gain over best prior)
- Scene “Beef”: FID=135.8, PSNR=34.92, LPIPS=0.195
On NVIDIA 35: ExpanDyNeRF is a close second (PSNR ≈29–30, LPIPS ≈0.08–0.03), indicating robust generalization.
Qualitatively, baseline models exhibit severe artifacts (fragmented geometry, “cardboard” flattening, misplaced limbs, or collapse of thin structures) when tested at ±30° deviations. ExpanDyNeRF uniquely maintains sharp geometry, correct occlusion, and temporal color stability under these conditions.
6. Implementation Considerations
- Training is performed scene-wise for 300k iterations on dual NVIDIA A100 GPUs (approx. 15 h/scene).
- Gaussian splatting priors are generated once per frame with negligible overhead.
- The pseudo–ground-truth loss is introduced after 50 warming epochs to allow backbone stabilization.
This sequence ensures the avoidance of gradient pathologies associated with noisy or unstable priors early in optimization.
7. Significance and Limitations
ExpanDyNeRF extends the applicability of dynamic NeRF frameworks to unconstrained, real-world video where only monocular, forward-facing camera input is available, and surpasses prior work in both metrics and qualitative stability under extreme viewpoint changes. The explicit use of 3D Gaussian splatting as a regularizing proxy, combined with the newly introduced SynDM dataset for large-deviation ground truth, establishes novel methodological and empirical standards in the field (Jiang et al., 16 Dec 2025).
A plausible implication is that while ExpanDyNeRF robustly outperforms competitors on wide-baseline evaluation, its reliance on synthetic multitarget supervision may limit effectiveness where no proxy of this kind can be generated or where the alignment between proxy and real scene is poor. Nonetheless, its ability to preserve geometry and appearance under challenging view synthesis scenarios suggests its core techniques may inspire further advances in monocular dynamic scene reconstruction.