4D Neural Voxel Splatting

Updated 8 November 2025

The paper introduces 4D Neural Voxel Splatting, a method that decouples spatial and temporal representations to achieve efficient and photoreal dynamic scene synthesis.
It employs a voxel-based neural representation combined with on-demand Gaussian generation and learned deformation fields to render arbitrary viewpoints at any time.
Experimental results show notable improvements in memory efficiency, training speed, and rendering quality compared to traditional 4D and 3D novel-view synthesis methods.

4D Neural Voxel Splatting (4D-NVS) refers to a family of methods for dynamic scene representation and novel-view synthesis that combine voxelized, feature-driven neural representations with temporally controlled Gaussian splatting. This class of methods enables real-time, photorealistic rendering and efficient storage for 4D (space-time) scenes, supporting high-fidelity, geometrically consistent synthesis of arbitrary viewpoints at arbitrary times. It is a central paradigm for scalable modeling of deformable and dynamic environments in computer vision and graphics.

1. Motivation and Conceptual Foundations

Dynamic scene rendering and novel-view synthesis require representing not only 3D spatial geometry, but also capturing and efficiently modeling latent temporal evolution. Simple replication of 3D Gaussians per frame incurs $O(NT)$ memory, where $N$ is the number of spatial Gaussians and $T$ is the number of time steps. 4D Neural Voxel Splatting (4D-NVS) introduces a separation between spatial parameterization (voxel grids or neural anchor points) and temporal deformation (learned continuous or discretized time fields), achieving sub-linear memory and computational scaling in $T$ .

The key insight is to maintain a single, compact grid of spatial "neural voxels" or anchor points, each storing multi-dimensional feature vectors, and to augment this grid at render time with geometric deformations that encode temporal changes. This approach allows on-demand synthesis of 4D Gaussian clouds for each timestamp, avoiding storage of explicit per-frame Gaussians and enabling temporally coherent, spatio-temporally disentangled dynamic representations (Wu et al., 1 Nov 2025).

2. Core Methodology: Voxel Anchors, Deformation Fields, and Splatting

The general pipeline of 4D-NVS for a dynamic scene consists of several stages:

Neural Voxel Representation:
- Initialize a sparse point cloud $P$ and build a uniform 3D voxel grid $V$ with anchor centers $\mathbf{x}_n \in \mathbb{R}^3$ .
- Each voxel $n$ stores a feature vector $f_n \in \mathbb{R}^f$ , a scale vector $\ell_n \in \mathbb{R}^3$ , and $k$ learnable offset vectors $O_{ni} \in \mathbb{R}^3$ .
On-Demand Gaussian Generation:
- For each viewpoint and time $t$ , cull voxels outside the frustum.
- For each visible voxel, generate $k$ Gaussians:
$\mu_{ni} = \mathbf{x}_n + O_{ni} \odot \ell_n$

Covariances are derived from scale and decoded rotation. Geometric and appearance parameters $(s_{ni}, r_{ni}, c_{ni}, \alpha_{ni})$ are inferred by shallow, per-voxel-condition MLPs, which take $(f_n, \delta, d)$ as input, where $\delta$ is viewing distance and $d$ is viewing direction.
Encoding Temporal Deformation:
- Use a HexPlane or K-Planes factorization of the 4D space-time domain: six $2$D planes (XY, XZ, YZ, XT, YT, ZT) of suitable resolution are stored.
- At each Gaussian's $(x, y, z, t)$ , bilinearly sample and aggregate all planes, yielding $f_h$ .
- Pass $f_h$ through small MLP decoders to produce geometric deltas $(\Delta \mu, \Delta r, \Delta s)$ :
$\mu' = \mu + \Delta \mu,\quad r' = r \oplus \Delta r,\quad s' = s \odot \Delta s$

Notably, only geometric parameters are deformed; appearance (color $c$ , opacity $\alpha$ ) remains static to preserve stability (Wu et al., 1 Nov 2025).
Rendering by Splatting:
- Transform the resulting Gaussians into camera (or world) coordinates.
- Project each to the image plane, sort by depth, and alpha-blend color and opacity using front-to-back compositing.
View Refinement Stage:
- Identify systematically poor-performing views ("crude views") using PSNR- or gradient-based heuristics.
- Allocate optimization budget primarily to these views, densifying anchors and splitting Gaussians more aggressively where losses are high, yielding improved quality for challenging angles.

This architecture achieves memory scaling $O(fV + F)$ , where $V$ is the number of voxels and $F$ is the deformation MLP size, as opposed to $O(NT)$ for brute-force per-frame methods.

3. Loss Formulation, Optimization, and Ablation Analysis

Losses typically integrate several terms:

$L_\mathrm{color}$ (photometric): $L_1 + \lambda_\mathrm{SSIM} L_\mathrm{SSIM}$ between rendered and ground-truth images,
$L_\mathrm{tv}$ (total variation) on the 4D deformation field for temporal smoothness,
$L_\mathrm{vol}$ : penalizes oversized Gaussians to control over-blending.

Ablation studies demonstrate:

Removing $L_\mathrm{vol}$ yields floating, oversized Gaussians and blurring artifacts.
Omitting $L_\mathrm{tv}$ degrades temporal consistency.
Deforming color and alpha parameters leads to unstable training and significant PSNR drop relative to geometry-only deformation.
Skipping view refinement reduces PSNR by 0.5–2 dB on the hardest views.

During training, anchor growth (densification) and low-opacity pruning are applied dynamically to adapt scene complexity.

4. Memory, Speed, and Rendering Characteristics

4D-NVS drastically reduces memory usage and computational requirements:

On HyperNeRF, peak memory: $\sim$ 3,050 MiB (4D-NVS) vs. 4,500 MiB (4D-GS), vs. 21,558 MiB for TiNeuVox-B.
Training time: $\sim$ 10–13 min (4D-NVS) vs. 25 min (4D-GS), vs. $>$ 32 h (HyperNeRF), using a single RTX-4090.
Inference speeds: 45 FPS (4D-NVS), 34 FPS (4D-GS), $>$ 5 FPS (other dynamic-NeRF variants).
Rendering quality: PSNR 28.5 dB and MS-SSIM 0.872 for HyperNeRF ( $+3.3$ dB, $+0.027$ over 4D-GS), and 33.12 dB vs. 31.91 dB for Neu3D, all at real-time speeds.

Selective deformation of only geometric parameters is crucial for stable and high-fidelity training. The use of targeted refinement for poor views yields up to $\sim$ 1.2 dB additional PSNR on those views with minor computation overhead.

5. Relation to Broader 4D Gaussian Splatting Methods

4D Neural Voxel Splatting is part of a broader landscape of methods that exploit explicit Gaussian primitives for dynamic scene modeling. Compared to fully 4D Gaussian parametric approaches—where each primitive carries a 4D (spatio-temporal) mean and covariance, and rendering requires 4D $\to$ 3D slicing (Yang et al., 30 Dec 2024, Duan et al., 5 Feb 2024)—the neural-voxel paradigm leverages factorized deformation fields tied to sparse voxel grids for storage and optimization efficiency.

Hybrid representations leverage voxelized codes for both geometry and color (Gan et al., 2022), with auxiliary MLPs for density, radiance, and deformation prediction. Tokenized or quantized variants use neural compression (Chen et al., 26 Apr 2025) for multi-rate storage, while other approaches decompose temporal signals hierarchically (video/segment/frame) (Hou et al., 23 May 2025) or separate time and space analytically to reduce redundant computation (Feng et al., 28 Mar 2025).

Feed-forward generative models for 4D scene synthesis (Zhu et al., 27 Sep 2025, Lu et al., 24 Sep 2025) build directly on neural-voxel splatting, integrating diffusion-based latent generative modules to synthesize explicit 4D Gaussian representations, which can then be rendered or further refined with video diffusion backbones.

6. Applications, Benchmarks, and Empirical Results

4D-NVS methods are used in dynamic novel-view synthesis (NVS), high-fidelity scene relighting or editing, and the generation of training data for downstream tasks in robotics and autonomous driving. Key applications include:

Real-time rendering of arbitrary camera trajectories and free-viewpoint video synthesis.
Data-efficient representation for storage and streaming of dynamic content.
Synthetic scene generation for self-supervised or simulation-based learning in perception stacks.

Benchmark results report:

On HyperNeRF: 4D-NVS achieves PSNR 28.5 dB vs. 25.2 dB (4D-GS) at 45 FPS (Wu et al., 1 Nov 2025).
On Neu3D: PSNR 33.12 dB vs. 31.91 dB and real-time ( $>$ 40 FPS).
Memory: 32–86% reduction over prior 4D-GS and NeRF-grid baselines.
Speed: Training 2–100 $\times$ faster, inference 2–10 $\times$ faster than non-voxelized or implicit field methods.

7. Limitations, Open Problems, and Future Directions

While 4D-NVS achieves high scalability and efficiency, certain limitations remain:

Loss of fine detail if excessive pruning or aggressive compression of voxel features is performed.
Temporal smoothness, though enforced via $L_\mathrm{tv}$ , can trade off against temporal sharpness, especially for abrupt events.
Selective deformation of only geometry may limit handling of non-rigid appearance changes, though naive deformation of appearance destabilizes optimization.
View refinement loops improve hard views but introduce additional per-view adaptivity that is not strictly real-time.

Research directions include integration of language or semantics (Li et al., 13 Mar 2025, Fiebelman et al., 14 Oct 2024), dynamic control signals, fully generative pipelines, and further neural compression and streaming strategies. Advances in neural voxel representations and 4D Gaussian primitives continue to bridge real-time rendering requirements with state-of-the-art visual fidelity for dynamic scenes in computer vision, graphics, and simulation.