4D Gaussian Splatting for Dynamic Scene Rendering

Updated 17 November 2025

4D Gaussian Splatting is a framework that models dynamic scenes using explicit 4D Gaussian primitives capturing spatial geometry and temporal motion.
It employs a splatting-based rendering pipeline that projects 4D Gaussians into 2D for efficient, real-time photorealistic scene reconstruction.
End-to-end optimization and modular extensions enable practical applications in video compression, VR/AR, SLAM, and more.

4D Gaussian Splatting is an explicit, real-time volumetric scene representation framework that generalizes 3D Gaussian Splatting to unify spatio-temporal geometry and appearance for dynamic scene modeling, view synthesis, and downstream tasks. By embedding a scene as a collection of anisotropic 4D Gaussian primitives—parameterized by (x, y, z, t) mean vectors, full-rank 4×4 positive-definite covariance matrices that enable space–time rotation and stretching, and view- and time-dependent appearance models—4D Gaussian Splatting allows direct, end-to-end optimization of photorealistic dynamic reconstructions from multi-view or monocular video. Its raster-based, splatting-centric rendering and richness of explicit parameterization have made 4DGS a leading approach for dynamic radiance field modeling, motion-aware neural rendering, compact real-time video scene representation, and rate–distortion-optimized video compression.

1. Mathematical Formulation and 4D Gaussian Primitives

4D Gaussian Splatting represents dynamic scenes as a sum of explicit Gaussian lobes in ℝ⁴, each capturing not just spatial but also temporal support. Given a primitive $i$ , its (unnormalized) density is

$G_i(\mathbf{p}) = \exp\left(-\frac{1}{2} (\mathbf{p} - \mu_i)^T \Sigma_i^{-1} (\mathbf{p} - \mu_i)\right)$

where $\mathbf{p} = (x, y, z, t)^T \in \mathbb{R}^4$ is a spatio-temporal query, $\mu_i \in \mathbb{R}^4$ is the 4D mean, and $\Sigma_i \in \mathbb{R}^{4 \times 4}$ is a full-rank covariance matrix. To capture arbitrary spatiotemporal anisotropy and orientation, $\Sigma_i$ is parameterized as

$\Sigma_i = R_i S_i S_i^T R_i^T$

with $S_i = \mathrm{diag}(s_x, s_y, s_z, s_t)$ and $R_i$ an arbitrary 4D rotation, implemented via two unit quaternions (as 4D isoclinic rotation). This construction allows each Gaussian to form any oriented ellipsoid in space–time, with axes aligned to both spatial shape and motion.

Marginalizing and conditioning with respect to time yields closed-form formulas for efficient slicing: $\begin{align*} p_i(t) &= \mathcal{N}(t; \mu_t, \Sigma_{tt}) \ p_i(\mathbf{x} | t) &= \mathcal{N}(\mathbf{x}; \mu_{xyz|t}, \Sigma_{xyz|t}) \ \mu_{xyz|t} &= \mu_{1:3} + \Sigma_{1:3,4} \Sigma_{4,4}^{-1} (t - \mu_t) \ \Sigma_{xyz|t} &= \Sigma_{1:3,1:3} - \Sigma_{1:3,4} \Sigma_{4,4}^{-1} \Sigma_{4,1:3} \end{align*}$ These equations admit efficient evaluation and slicing at arbitrary timestamps during rendering or gradient-based optimization.

2. Appearance Modeling via 4D Spherindrical Harmonics

For photorealistic rendering and view/time-dependent effects, each primitive is endowed with a small bank of coefficients for a separable Fourier–spherical-harmonic basis (“4D spherindrical harmonics”): $Z_{n\ell}^m(\Delta t, \theta, \phi) = \cos\left(2\pi n \frac{\Delta t}{T}\right) Y_\ell^m(\theta, \phi)$ where $n$ indexes temporal Fourier frequency, $Y_\ell^m$ are angular spherical harmonics, and $(\theta, \phi)$ are the RGB observation direction. Each Gaussian's color is

$c_i(\mathbf{d}, t) = \sum_{n,\ell,m} c_i^{n\ell m} \, Z_{n\ell}^m(\Delta t, \theta, \phi)$

capturing radiance that evolves across both viewpoints and time. View- and time-evolving appearance is crucial for faithful modeling of specularity, nontrivial illuminations, and nonstationary surface effects.

3. Efficient Splatting-Based Rendering Pipeline

Rendering in 4D Gaussian Splatting follows a real-time splatting paradigm. For a camera $\mathcal{C}$ (extrinsic $E$ , intrinsic $K$ ), pixel $(u, v)$ , and query time $t$ , the contributions from each Gaussian are:

Temporal Marginalization: Compute 1D weight, $p_i(t)$ , for alignment to time $t$ .
Spatial Conditional Slicing and Projection: Compute $\mu_{xyz|t}, \Sigma_{xyz|t}$ and project into camera space, obtaining a 2D Gaussian $p_i(u, v | t)$ .
Splatting and Compositing: Each primitive is rasterized as a 2D elliptical kernel with per-pixel weight $w_i = p_i(t) p_i(u, v | t) \alpha_i$ , textured with $c_i$ . Depth-sorting then enables physically correct front-to-back alpha compositing:

$I(u, v, t) = \sum_i w_i c_i(\mathbf{d}, t),\quad w_i = p_i(t) p_i(u, v | t) \alpha_i \prod_{j < i} (1 - p_j(t) p_j(u, v | t) \alpha_j)$

This explicit 2D splatting—accumulating only active splats at each time—yields real-time performance (e.g., 114 FPS at HD resolutions), and GPU-amenable implementations due to its highly parallel structure.

4. End-to-End Learning and Optimization Strategies

All 4D Gaussian parameters $\{\mu_i, \Sigma_i, \alpha_i, c_i^{n\ell m}\}$ are optimized jointly via gradient descent, typically minimizing a photometric $L_2$ loss between the rendered and ground truth images: $\mathcal{L} = \|I(u, v, t) - I_{\rm gt}(u, v, t)\|_2^2$ Perceptual losses (e.g. LPIPS) and structural similarity (SSIM) terms may be added for increased fidelity. To ensure temporal coherence, training batches often sample rays across multiple time instants (as opposed to slicing-by-frame).

Model complexity is adaptively controlled: spatial and temporal gradients of means are monitored, and underfit regions are densified by spawning new Gaussians, while redundancy is reduced via pruning. No additional deformation networks or explicit motion fields are required; the rotation and scale of each 4D ellipsoid suffice to model scene flow.

Initialization is commonly performed with $\sim$ 100K points from a static point cloud (e.g., COLMAP), uniformly spread in time, with identity rotations and temporally broad $s_t$ to ensure broad initial coverage.

5. Performance, Quality, and Trade-offs

The 4DGS framework achieves notable performance/quality trade-offs:

Dataset	Resolution	PSNR (dB)	LPIPS	DSSIM	FPS
Plenoptic Video	800×800	32.01	0.055	0.014	114
D-NeRF (monocular)	–	34.09	–	–	–

Temporal stability is maintained as, for static regions, $s_t$ grows larger over time, thus the number of active splats per frame remains nearly constant as the total time $T$ increases. This property ensures scalability to longer videos without linear growth in the number of Gaussians rendered per frame.

Compared to contemporaneous neural field methods (e.g., HexPlane), 4DGS shows improved visual fidelity and two orders-of-magnitude faster rendering (Yang et al., 2023), validating the efficacy of the explicit, splatting-centric approach.

6. Model Variants and Extensions

Variants have been derived to address memory and computational efficiency while maintaining quality:

Pruning/Compression: Aggressive pruning and quantization via Sub-Vector Quantization (SVQ) have delivered $\sim$ 99% storage reduction with minimal PSNR drop (Lee et al., 4 Oct 2025).
Hybrid Approaches: Hybrid 3D–4DGS representations employ 4D splats only for dynamic regions, with 3DGS for static background, yielding dramatic speed/memory benefits (Oh et al., 19 May 2025).
Hierarchical/Residual Modeling: Cascaded temporal decomposition (e.g., CTRL-GS) adds hierarchical video-segment-frame residual prediction for greater expressivity in high-motion settings (Hou et al., 23 May 2025).
Task-Driven Extensions: 4DGS has been adapted for real-time SLAM with dynamic/static splitting and optical flow supervision (Li et al., 20 Mar 2025), 4D language grounding (Fiebelman et al., 14 Oct 2024), style transfer (Liang et al., 14 Oct 2024), and surgical scene reconstruction on resource-limited hardware (Liu et al., 23 Jun 2024).
Physically Constrained Variants: Wasserstein state-space filtering and temporal smoothness-aware compression further regularize and compress dynamic splat trajectories (Deng et al., 30 Nov 2024, Lee et al., 23 Jul 2025).

Such modularity enables 4DGS to be tailored for diverse real-world scenarios, including large-scale video, free-viewpoint VR/AR, and continuous medical tomography (Yu et al., 27 Mar 2025).

7. Impact, Limitations, and Prospects

Treating spacetime as a unified anisotropic domain and parameterizing each primitive with a full 4D Gaussian supplemented by harmonic coefficients allow 4DGS to achieve end-to-end photorealistic scene modeling and real-time dynamic rendering. The explicit separation of space and time avoids the need for motion priors, complex deformation fields, or regularization on implicit signals.

Limitations include:

Large storage and memory demands in uncompressed form, motivating ongoing compression efforts.
The smoothness and support size of Gaussians can limit the representation of high-frequency details and very abrupt occlusions.
Absence of explicit long-range correlation mechanisms may hinder modeling in ultra-sparse regimes or highly articulated motion.

Ongoing research addresses these with more expressive basis functions, hierarchical or acceleration-encoded kernels, adaptive spatiotemporal splits, and integration with semantic or language-driven guidance.

4D Gaussian Splatting defines an explicit, optimizable, and splatting-based approach to dynamic scene representation that provides a foundation for real-time photorealistic rendering, highly efficient compression, and downstream applications in graphics, vision, medical imaging, and robotics. Its core mathematical structure and splatting-first rendering paradigm have catalyzed broad interdisciplinary interest and active methodological evolution.