Papers
Topics
Authors
Recent
Search
2000 character limit reached

4D Gaussian Splatting in Dynamic Scene Rendering

Updated 6 February 2026
  • 4D Gaussian Splatting is a method that models dynamic scenes as a set of explicit, anisotropic 4D Gaussians capturing spatial and temporal correlations.
  • It employs a compact spherindrical harmonics-based appearance model and differentiable splatting to achieve photoreal novel-view synthesis at over 100 FPS.
  • The framework outperforms prior models by delivering high PSNR and low LPIPS scores while enabling applications like real-time video capture, digital twins, and interactive scene editing.

4D Gaussian Splatting (4D-GS) is an explicit, volumetric representation and rendering framework for dynamic (time-varying) 3D scenes. Introduced by Yang et al. (Yang et al., 2023), 4D-GS addresses the inherent limitations of previous neural implicit and deformable radiance field approaches by directly modeling the full 4D spatio-temporal volume—space (ℝ³) and time (ℝ)—with a set of highly expressive, anisotropic, rotated 4D Gaussian primitives. The method achieves photorealistic novel-view synthesis at real time, supporting diverse downstream applications in video-based scene capture, digital twins, interactive editing, and efficient dynamic view rendering.

1. Core 4D Gaussian Primitive Representation

A dynamic scene is encoded as a sum of NN explicit 4D Gaussian primitives {Gi}\{G_i\}, each parameterized by a 4D mean μi=(μx,μy,μz,μt)R4\mu_i = (\mu_x, \mu_y, \mu_z, \mu_t) \in \mathbb{R}^4 and a full covariance ΣiR4×4\Sigma_i \in \mathbb{R}^{4 \times 4}: Gi(x,t)=exp ⁣(12([x,t]μi)Σi1([x,t]μi))G_i(x, t) = \exp\!\left(-\frac{1}{2}\left( [x, t] - \mu_i \right)^\top \Sigma_i^{-1} \left( [x, t] - \mu_i \right) \right) where xR3x \in \mathbb{R}^3, tRt \in \mathbb{R}.

To enable stable optimization and full 4D anisotropy (including spatio-temporal orientation), Σi\Sigma_i is factored as: Σi=RiSi2Ri\Sigma_i = R_i S_i^2 R_i^\top where Si=diag(sx,sy,sz,st)S_i = \mathrm{diag}(s_x, s_y, s_z, s_t) and RiSO(4)R_i \in \mathrm{SO}(4) (a 4D rotation), parameterized using two quaternions. Each primitive thus defines an oriented, ellipsoidal support in space–time, with sts_t controlling its temporal extent; μt\mu_t encodes its temporal position; and the full Σi\Sigma_i enables modeling of non-axis-aligned motion (e.g., scene elements moving along oblique space–time paths).

Conditional and marginalization identities from multivariate Gaussians yield:

  • The spatial “slice” at time tt is a 3D Gaussian with:

μxyzt=μ1:3+Σ1:3,4Σ4,41(tμt) Σxyzt=Σ1:3,1:3Σ1:3,4Σ4,41Σ4,1:3\begin{aligned} \mu_{xyz\,|\,t} &= \mu_{1:3} + \Sigma_{1:3,4} \Sigma_{4,4}^{-1} (t - \mu_t) \ \Sigma_{xyz\,|\,t} &= \Sigma_{1:3,1:3} - \Sigma_{1:3,4} \Sigma_{4,4}^{-1} \Sigma_{4,1:3} \end{aligned}

  • The temporal marginal weight:

pi(t)=exp(12(tμt)2Σ4,41)p_i(t) = \exp\left( -\frac{1}{2}(t - \mu_t)^2 \Sigma_{4,4}^{-1} \right)

2. Appearance Model: 4D Spherindrical Harmonics

View- and time-dependent color is modeled with a compact, explicit expansion: ci(d,t)=n=0Nl=0Lm=llai,n,l,mZn,lm(t,θ,ϕ)c_i(d, t) = \sum_{n=0}^{N} \sum_{l=0}^{L} \sum_{m=-l}^l a_{i,n,l,m} Z_{n,l}^m(t, \theta, \phi) where d=(θ,ϕ)d = (\theta, \phi) are spherical camera directions and Zn,lmZ_{n,l}^m is the 4D spherindrical basis: Zn,lm(t,θ,ϕ)=cos(2πnTt)Ylm(θ,ϕ)Z_{n,l}^m(t, \theta, \phi) = \cos\left( \frac{2\pi n}{T} t \right) Y_l^m(\theta, \phi) for a scene duration TT, YlmY_l^m being spherical harmonics.

This separable basis efficiently captures both high-frequency view-dependent reflectance and time-evolving appearance, with the learned coefficients ai,n,l,ma_{i,n,l,m} per Gaussian.

3. Rendering Pipeline and Differentiable Splatting

The rendered color I(u,v,t)\mathcal{I}(u, v, t) at pixel (u,v)(u, v) and time tt is computed by:

  • Projecting each conditional 3D Gaussian (at tt) into image space using camera parameters, linearizing projection via the Jacobian JJ.
  • Computing the 2D projected Gaussian parameters:

μi2d=Proj(μxyzt;E,K)1:2 Σi2d=(JEΣxyztEJ)1:2,1:2\begin{aligned} \mu_i^{2d} &= \mathrm{Proj}(\mu_{xyz\,|\,t}; E, K)_{1:2} \ \Sigma_i^{2d} &= (J E \Sigma_{xyz\,|\,t} E^\top J^\top)_{1:2,1:2} \end{aligned}

  • Compositing splats with per-pixel weights:

I(u,v,t)=i=1Npi(t)pi(u,vt)αici(d,t)j<i[1pj(t)pj(u,vt)αj]\mathcal{I}(u, v, t) = \sum_{i=1}^N p_i(t) \, p_i(u, v\,|\,t)\, \alpha_i\, c_i(d, t)\, \prod_{j < i} \left[1 - p_j(t)\, p_j(u, v\,|\,t)\, \alpha_j\right]

where αi\alpha_i is a learned opacity.

GPU tile-based splat rasterization and depth-sorted blending (alpha compositing) yield efficient >100>100 FPS rendering at high resolutions. Gaussians with negligible pi(t)p_i(t) are pruned per frame.

4. Optimization and Training Protocol

Supervision is applied via photometric 2\ell_2 loss on (pixel, time) samples: Lphoto=kI(uk,vk,tk)Igt(uk,vk,tk)22\mathcal{L}_{\mathrm{photo}} = \sum_k \left\|\,\mathcal{I}(u_k, v_k, t_k) - \mathcal{I}^{\mathrm{gt}}(u_k, v_k, t_k)\,\right\|_2^2

Adaptive densification and pruning are performed using spatial/temporal gradient magnitudes:

  • Gaussians with low spatial gradient are pruned (insufficient reconstruction).
  • High-gradient Gaussians are split in full 4D space–time (to capture detail).
  • The mean temporal gradient of μt\mu_t is monitored to ensure even time coverage.

Training batches rays sampled uniformly in (u,v,t)(u, v, t), rather than sequential frames, enforcing temporal consistency and suppressing flicker.

Initialization uses colored point clouds (e.g., COLMAP) at t=0t=0, with μt\mu_t initialized randomly in [0,T][0, T] and temporal scale st=T/2s_t = T/2. End-to-end training runs for 30\sim30k iterations (batch size 4), with densification rate halved at halfway point.

5. Empirical Performance and Benchmarks

On the Plenoptic Video (multi-view, real) benchmark, 4D-GS achieves:

  • PSNR = 32.01, DSSIM = 0.014, LPIPS = 0.055
  • \sim114 FPS on a single NVIDIA GPU

This surpasses prior neural dynamic scene models (DyNeRF, HexPlane, K-Planes, StreamRF, etc.) on both fidelity (PSNR, LPIPS) and real-time speed (often >10×\times faster than NeRF-based methods).

On monocular, under-constrained synthetic (D-NeRF) scenes, 4D-GS attains PSNR = 34.09 at real-time frame rates.

6. Methodological Distinctions and Theoretical Properties

  • True 4D Native Representation: By representing spacetime as an explicit collection of 4D Gaussians, 4D-GS avoids overparametrizing time via separate deformation fields or per-frame duplication. All space–time correlations (motion, temporal occlusion, appearance drift) are encoded natively via the Σi\Sigma_i and spherindrical expansion.
  • Compact View-Time Appearance: Spherindrical harmonics provide a parsimonious but expressive basis for handling high-frequency view and time effects, enabling both photorealism and efficient memory use.
  • Scalability and Flexibility: The rasterization and compositing algorithm is GPU-friendly and scales with the number of visible Gaussians per frame, not the number of input images or total scene length.
  • Optimization Simplicity: No additional regularizers or motion priors are required. All geometry, appearance, and motion are learned end-to-end, with dynamic splitting and pruning providing automatic model adaptation.

7. Extensions and Applications

The 4DGS framework catalyzed further research exploring:

4D Gaussian Splatting has become a foundational approach for real-time, explicit, photorealistic dynamic scene representation, providing both practical utility and a mathematically tractable paradigm for space–time visual modeling (Yang et al., 2023, Yang et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to 4D Gaussian Splatting (4D-GS).