4D Gaussian Models for Dynamic Scene Rendering

Updated 30 June 2025

4D Gaussian Models are explicit scene representations defined by dynamic Gaussian primitives that couple spatial and temporal dimensions.
They leverage hybrid static-dynamic decompositions and learnable deformation fields to capture evolving scene details with high fidelity.
These models offer real-time rendering, memory efficiency, and scalability for applications like novel view synthesis, editing, and medical imaging.

4D Gaussian Models are explicit scene representations built from dynamic Gaussian primitives in four-dimensional (space-time) domains, supporting efficient, high-fidelity rendering, reconstruction, and manipulation of dynamic scenes. Emerging as a successor to per-frame 3D Gaussian Splatting, 4D Gaussian approaches introduce both spatial and temporal coupling, leveraging native 4D parametrizations, hybrid static-dynamic decompositions, and compact, learnable deformation fields. These models underpin numerous state-of-the-art techniques in dynamic scene synthesis, novel view/time rendering, segmentation, editing, and related vision and graphics tasks.

1. Formal Foundations and Representational Principles

A 4D Gaussian is defined by its mean $\mu = (\mu_x, \mu_y, \mu_z, \mu_t)$ and covariance $\Sigma$ as an anisotropic ellipsoid in $(x, y, z, t)$ space: $p(x|\mu, \Sigma) = \exp\left[-\frac{1}{2} (x-\mu)^\top \Sigma^{-1} (x-\mu)\right]$ with

$\Sigma = R S S^\top R^\top$

$R$ being a 4D rotation matrix (e.g., constructed from left/right quaternions or geometric algebra rotors), $S$ is diagonal scaling. Each primitive thus covers a localized spatiotemporal region—spanning both geometry and associated time interval.

A 4D scene is the union of such Gaussians, $\mathcal{G} = \{\mathcal{N}_i\}$ , optionally augmented with view-dependent appearance via 4D Spherindrical Harmonics: $Z_{nl}^{m}(t, \theta, \phi) = \cos\left( \frac{2\pi n}{T} t \right) Y_l^m(\theta, \phi)$ where $Y_l^m$ is a spherical harmonic, enabling view-time varying color/appearance. Rendering at time $t$ involves slicing the 4D Gaussians to generate 3D projections active at $t$ —these are further projected to the image plane and blended: $C = \sum_{i=1}^N c_i \alpha_i \prod_{j=1}^{i-1} (1-\alpha_j)$ where $c_i$ and $\alpha_i$ are color and opacity, respectively.

2. Deformation Fields and Temporal Dynamics

To model evolving scenes, 4D Gaussian methods introduce deformation fields that parameterize how a canonical set of 3D Gaussians move and change over time. The general form for per-Gaussian attribute $S$ at time $t$ : $S(t) = S_0 + D(t)$ with $D(t)$ the temporal residual. Various approaches exist:

Neural deformation MLPs: Use a 4D Hexplane- or K-Planes-inspired decomposition, interpolating features from 2D planes (e.g., in $(x,y)$ , $(x,t)$ , etc.), concatenated and passed through small MLPs for deformation prediction (2310.08528).
Explicit deformation curve fitting: Model $D(t)$ as a polynomial (global, smooth) plus truncated Fourier (local, high-frequency) series for each Gaussian, as in Gaussian-Flow (2312.03431):

$D(t) = \sum_{n=0}^N a_n t^n + \sum_{l=1}^L (f^l_{sin} \cos(lt) + f^l_{cos} \sin(lt))$

Velocity and lifespan parametrization: For real-time and scalable systems, the mean and orientation are evolved via learned velocities and angular velocities with temporal falloff (2406.10324, 2506.08015):

$\mathbf{x}_{t} = \mathbf{x} + \mathbf{v}(t - c), \quad \mathbf{o}_{t} = o \cdot \exp\left( -\frac{1}{2} \frac{(t-c)^2}{\sigma^2} \right)$

Slicing a 4D Gaussian at time $t$ yields a 3D Gaussian whose mean, shape, and influence evolve over time, naturally encoding both spatial and temporal motion.

3. Memory Efficiency and Model Variants

Direct 4D Gaussian representations raise concerns of storage overhead—especially with large scenes and long videos. Multiple strategies are employed:

Disentangled 3D/4D Hybrid: Static regions are represented by 3D Gaussians; 4D Gaussians are reserved for truly dynamic regions. Iterative reassignment prunes temporally invariant elements into static 3D sets, reducing memory and computation (2505.13215).
Color Parameter Compression: Replace per-Gaussian spherical harmonics (up to 144 parameters) with a direct color component and a shared, small MLP for dynamic color prediction (DC-AC model), realizing $125\times$ or greater storage reduction (2410.13613).
Lightweight Feature Fields: Pool and condense neural voxel fields for deformation encoding, reducing redundancy; prune Gaussians and their attributes based on learned deformation or importance metrics (2406.16073).
Sparsity and Densification: Explicit pruning and densification cycles, driven by spatial/temporal error signals and entropy losses, maintain only necessary, active Gaussians.

4. Training and Optimization Methodologies

Training procedures are predominantly end-to-end, using differentiable 4D rasterization engines for both photometric and auxiliary supervision:

Supervision: Per-frame rendering losses (RGB MSE, LPIPS, SSIM), semantic mask losses, sparse or dense geometric constraints (depth, normals, flow).
Temporal Regularization: Encourage temporal smoothness and local spatiotemporal coherence using regularizations:

$\mathcal{L}_t = \|D(t) - D(t+\epsilon)\|_2$

and neighbor consistency losses.

Hybrid Optimization & Feed-forward Inference: Recent large models perform direct scene prediction via neural architectures (U-Net, Transformer) from monocular or multiview video in a single pass (2406.10324, 2506.08015), with further training-stage pruning for density control in space-time.

5. Computational Efficiency and Real-Time Rendering

Native 4D Gaussian models, especially those with disentangled parameterization, offer significant efficiency:

Real-time Rendering: Customized CUDA backends and rasterization yield hundreds to thousands of FPS at HD resolution (e.g., 4DRotorGS achieves 277–583 FPS (2402.03307); Disentangled4DGS, 343 FPS (2503.22159)).
Memory and Storage: Techniques such as color compression, half-precision storage, and zip/delta coding enable large scenes to be represented in tens of MB (MEGA: $190\times$ compression) (2410.13613).
Scalability: Models scale to long, complex dynamic videos by limiting per-frame Gaussian count (via temporal falloff or dynamic pruning) and adopting autoregressive or chunked inference when memory bounds are met.

6. Key Applications and Comparative Metrics

4D Gaussian Splatting models are broadly applicable:

Dynamic Scene and Video Synthesis: Real-time novel view/time rendering of moving scenes, supporting free-viewpoint dynamics with photorealism and temporal consistency (2310.08528, 2402.03307, 2412.20720).
4D Content Generation and Animation: Fast, controllable dynamic asset generation from images or text, with explicit deformation and appearance interpolation (2312.17142).
Segmentation and Object Tracking: Temporal identity feature fields allow robust object identification and segmentation in space-time, overcoming challenges like Gaussian drifting (2407.04504).
Editing and Manipulation: Scalably support efficient appearance and geometry edits via static-dynamic separation and score distillation refinements (2502.02091).
Medical and Scientific Imaging: Continuous-time tomographic reconstruction via radiative 4D Gaussian splatting with self-supervised periodicity for motion correction in CT (2503.21779).

When compared to NeRF-like and CNN-based volumetric approaches, 4D Gaussian frameworks typically exhibit:

Method/Class	FPS↑	PSNR↑	Memory	Training Time	Dynamic Handling
NeRF/HyperNeRF	≤1	19-27	Large	16–32 hr	Implicit neural fields
3DGS (per frame)	≤10	22–29	Very Large	1+ hr	Static/duplicate per frame
Gaussian-Flow (4D)	125	23–32	Compact	7–12 min	Explicit per-point DDDM
4D-GS, Rotor4DGS, MEGA, etc.	82–1250	30–35	Minimal–Tiny	5–60 min	Native 4D representation
Hybrid 3D–4DGS (Adaptive)	200+	≥33	Lowest	12 min–1 hr	Adaptive static/dynamic assign.

PSNR, SSIM, and LPIPS scores are matched or exceeded in 4DGS-based models over benchmarks (Plenoptic Video, D-NeRF, HyperNeRF), with orders of magnitude faster rendering and lower storage.

7. Impact and Current Research Directions

The development of 4D Gaussian splatting has substantially advanced the efficiency and editability of dynamic scene representations. Key impacts are:

Democratization of interactive, immersive dynamic graphics: Free-viewpoint and temporally resolved scene rendering in VR/AR, film, robotics, medical imaging.
Scalability for long-form, large-scale data: Hybrid models and memory-efficient representation allow practical deployment in resource-constrained settings such as embedded robotics or surgical devices.
Research avenues: Including integration with generative priors for unseen object synthesis, multimodal segmentation, language grounding, scene editing, and continuous-time tomographic reconstruction.

This suggests an ongoing convergence toward unified, explicit, memory- and computation-efficient spatiotemporal scene modeling frameworks that can serve a broad array of scientific, industrial, and creative applications.