4D Neural Voxels: Dynamic Scene Representation

Updated 26 November 2025

4D neural voxels are neural fields defined over a four-dimensional domain (three spatial and one temporal or semantic axis) to capture dynamic and deformable 3D content.
They integrate explicit grids, factorized plane fields, and coordinate-based MLPs to offer compact storage, rapid training, and efficient spatiotemporal interpolation.
These representations underpin applications in dynamic scene rendering, point cloud compression, and avatar creation by balancing fidelity with computational efficiency.

A 4D neural voxel is a neural field defined on a discrete or continuous four-dimensional domain, typically with three spatial coordinates and one temporal axis or semantic index, which encodes volumetric, radiance, deformation, or attribute information. These representations generalize classic 3D voxels for dynamic scene modeling, compression, rendering, and shape deformation, underpinning multiple state-of-the-art approaches in neural scene synthesis, compression, and avatar generation. 4D neural voxels are implemented as explicit grids, factorized plane-based fields, or coordinate-based multi-layer perceptrons (MLPs). Their design enables highly compact storage, rapid training and inference, and efficient spatiotemporal interpolation for dynamic and deformable 3D content.

1. Formal Definitions and Parameterizations

4D neural voxels are instantiated as tensor-valued neural fields with a domain in $(x,y,z,t)$ for spatiotemporal applications or $(x,y,z,i)$ when indexing expression basis, as in facial avatar modeling. The explicit structure can be a dense 4D grid (e.g., $V \in \mathbb{R}^{h \times N_x \times N_y \times N_z \times N_t}$ ), though practically, memory and computational constraints drive most modern designs toward factorizations or sparse encodings. For example, HexPlane and K-Planes decompositions use a set of 2D feature planes for each pair of axes, such as $(x,y)$ , $(x,z)$ , $(y,z)$ , $(x,t)$ , $(y,t)$ , $(z,t)$ , from which bilinear interpolation produces local features. The feature dimension h varies typically between 16 and 128, and for dynamic avatars, a “basis tensor” stacks $N$ 3D voxel volumes along an expression index to form a 5D tensor $V_d^{basis} \in \mathbb{R}^{N \times C_d \times L_d \times L_d \times L_d}$ , fused on demand using learned coefficients (Xu et al., 2022).

2. Algorithmic Approaches for Dynamic Neural Scene Representation

Dynamic scene modeling in 4D neural voxels leverages several architectural paradigms:

Implicit Neural Fields: Coordinate-based MLPs receive $(x,y,z,t)$ as input and predict scene attributes, occupancy, or color, with positional encoding schemes (sinusoidal or conditional) increasing spatial/temporal resolution and expressive power (Ruan et al., 11 Dec 2024, Du et al., 2020). For instance, 4D-NeRC³ directly fits an occupancy function $f_{occ}: \mathbb{R}^4 \to [0,1]$ and an attribute function $f_{attr}: \mathbb{R}^4 \to \mathbb{R}^c$ to jointly compress dynamic point clouds.
Voxel or Plane Factorizations: HexPlane and K-Planes approaches factorize a 4D grid into several 2D feature planes, dramatically reducing memory usage while permitting high-resolution interpolation. Features from each relevant plane are fused (summed, multiplied, or concatenated) and decoded by small MLP heads into per-Gaussian deformation or radiance parameters (Wu et al., 2023, Chen et al., 26 Apr 2025, Wu et al., 1 Nov 2025).
Hybrid Voxel-MLP Pipelines: Methods such as V4D treat spatial grids as primary, condition MLPs on temporal or semantic indices, and employ plug-in modules (e.g., LUT-based refinement) for pixel-level enhancement and rapid inference (Gan et al., 2022).
Mixed Static-Dynamic Voxels: MixVoxels partitions the scene into static and dynamic regions via a variation field, processing static voxels with lightweight models and dynamic voxels with higher-capacity temporal modules. Inner-product queries optimize efficiency, as decompressed per-voxel features can be rapidly mapped to multiple time-steps via learned time embeddings (Wang et al., 2022).

3. Compression, Deformation, and Rendering Pipelines

4D neural voxels are foundational to compression, deformation-aware synthesis, and efficient rendering:

Point Cloud Compression: NeRC³ and its extension 4D-NeRC³ utilize 4D occupancy and attribute networks for geometry-attribute coding, quantizing network parameters, and transmitting auxiliary cube indices and per-frame thresholds for reconstruction. Temporal redundancy is mitigated via weight-space differencing (r-NeRC³) and Bezier parameter curves (c-NeRC³) (Ruan et al., 11 Dec 2024).
Dynamic Gaussian Splatting: 4D-GS and 4D-NVS approaches favor storing canonical 3D Gaussians and superimposing learned time-dependent deformations via 4D neural voxel fields. At each timestamp, Gaussians read from the 4D voxel, update position, scale, and rotation, and render via differentiable splatting (Wu et al., 2023, Wu et al., 1 Nov 2025). 4DGS-CC further compresses these voxel fields using Neural Voxel Contextual Coding (NVCC), exploiting spatial and temporal priors, while the canonical Gaussian codebook is compressed via Vector Quantization Contextual Coding (VQCC) for optimal storage (Chen et al., 26 Apr 2025).
Avatar and Deformable Reconstruction: AvatarMAV fuses 3DMM expression coefficients with per-basis 3D neural voxel grids, yielding rapid avatar photorealism and fast convergence (Xu et al., 2022). Ub4D leverages coordinate-based neural deformation maps and canonical signed distance fields for highly deformable monocular 4D reconstruction (Johnson et al., 2022).
Fast Multi-view Synthesis: MixVoxels, via static/dynamic partitioning and efficient temporal querying, achieves competitive rendering quality with reduced training time (Wang et al., 2022), while V4D demonstrates that voxel-MPL hybridization and LUT refinement drive rapid inference at high fidelity (Gan et al., 2022).

4. Network Architecture and Positional Encoding Techniques

Architectural choices impact both quality and efficiency:

Residual Block MLPs: Occupancy and attribute networks in 4D-NeRC³ are two- and three-block fully connected networks, with ReLU and SIREN activations enhancing geometry and attribute expressiveness (Ruan et al., 11 Dec 2024).
Low-Rank/Tensor Factorizations: Systems such as MixVoxels and 4DGS-CC factorize large voxel grids for memory efficiency, often reducing storage overhead by an order of magnitude compared to dense representations (Wang et al., 2022, Chen et al., 26 Apr 2025).
Conditional and Sine-Based Positional Encoding: Separate encodings on spatial and temporal axes (e.g., $PE_x$ with $L_x=12$ , $PE_t$ with $L_t=4$ ) are used in 4D-NeRC³; V4D employs time-conditioned phase shifts in positional encoding to synchronize with the temporal variation (Ruan et al., 11 Dec 2024, Gan et al., 2022). SIREN modules further unlock high-frequency information in attribute predictions.

5. Quantitative Performance, Memory, and Computational Trade-offs

Quantitative results underscore the efficacy of 4D neural voxels:

Method	Task	Memory	Training Time	FPS	PSNR (dB)	SSIM/MS-SSIM	LPIPS
4D-NeRC³ (Ruan et al., 11 Dec 2024)	Point cloud geometry + attr	Param-MLP, cubes	--	--	See paper	--	--
4D-GS (Wu et al., 2023)	Dyn. scene synth	18 MB	20 min	82	34.05	0.98	0.02
4D-NVS (Wu et al., 1 Nov 2025)	Dyn. Voxel Splatting	3,050 MiB	13 min	44	28.5	0.872	--
MixVoxels-S (Wang et al., 2022)	Multi-view video	500 MB	15 min	37.7	31.03	0.022	0.129
V4D (Gan et al., 2022)	Novel view synth	377 MB	6.9 h	--	33.72	0.98	0.02
AvatarMAV (Xu et al., 2022)	Head avatar	--	5 min	--	30.4	0.96	0.038
Ub4D (Johnson et al., 2022)	Monocular deform.	--	~17–26 h	--	CD=3.06	--	--

In point cloud compression, 4D-NeRC³ achieves up to −89.07% BD-BR against G-PCC, with geometry-distortion improvements versus multiple standards (Ruan et al., 11 Dec 2024). Dynamic Gaussian splatting with 4D neural voxels offers real-time synthesis across a range of datasets and outpaces NeRF-style architectures on speed and memory (Wu et al., 2023, Wu et al., 1 Nov 2025). Compression frameworks like 4DGS-CC reduce storage by ~12× with fidelity losses within 1–2% (Chen et al., 26 Apr 2025).

6. Limitations and Future Directions

Key limitations include boundary artifacts at static-dynamic splits (Wang et al., 2022), memory overhead of full voxel grids (Gan et al., 2022), and the requirement for high-quality proxy deformation models in highly non-rigid reconstruction (Johnson et al., 2022). Temporal under-segmentation or inaccurate semantic priors can result in motion blurring or localized artifacts. Methods like 4DGS-CC address storage via advanced contextual coding but face scaling challenges for large-scale dynamic content.

Potential future developments include adaptive grid sparsification, richer semantic parameterizations for avatars (hair/body), online incremental learning for avatars (Xu et al., 2022), and learned deformation fields for topologically evolving scenes (Gan et al., 2022). Compression strategies continue to evolve toward joint optimization of voxel and codebook entropy losses. Factorized voxel and grid structures (TensoRF-style, low-rank planes) are proposed to further compress memory footprints and accelerate deployment (Wang et al., 2022).

7. Application Domains and Comparative Insights

4D neural voxels are deployed in compression, dynamic novel-view synthesis, real-time scene rendering, facial avatar creation, and non-rigid shape reconstruction. They form the technical core of frameworks for real-time rendering (4D-GS, 4D-NVS), high-fidelity compression (4D-NeRC³, 4DGS-CC), rapid avatar optimization (AvatarMAV), and multi-view dynamic video synthesis (MixVoxels, V4D). Comparative studies robustly favor 4D neural voxel approaches over MLP-only baselines and canonical-flow methods in terms of fidelity, rendering speed, and memory usage (Wu et al., 2023, Gan et al., 2022, Ruan et al., 11 Dec 2024). A plausible implication is that factorized 4D voxel architectures will increasingly dominate efficient and scalable modeling of dynamic and deformable 3D content across academic and industrial domains.