Papers
Topics
Authors
Recent
Search
2000 character limit reached

4DGS: Pose-free Feed-forward 4D Gaussian Splatting

Updated 9 April 2026
  • The paper introduces a pose-free, feed-forward paradigm that models dynamic scenes as continuous 4D volumes using anisotropic Gaussian primitives.
  • It employs 4D spherindrical harmonics for view- and time-dependent appearance, enabling the accurate capture of non-Lambertian effects and dynamic lighting.
  • Real-time performance above 100 FPS on consumer GPUs is achieved without ray marching, ensuring temporally coherent and high-definition novel view synthesis.

Pose-free Feed-forward 4D Gaussian Splatting (4DGS) is a data-driven paradigm for real-time photorealistic dynamic scene representation and rendering. 4DGS treats a dynamic scene as a continuous 4D spatio-temporal volume and models its radiance function as a mapping from R4×S2R3\mathbb{R}^4 \times \mathbb{S}^2 \rightarrow \mathbb{R}^3, that is, (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}, which is approximated by a finite set of anisotropic 4D Gaussian primitives carrying view- and time-dependent appearance coefficients. The system enables efficient real-time synthesis of temporally coherent, photorealistic novel views from either monocular or multi-view video, using a fully differentiable and end-to-end trainable feed-forward inference pipeline without Multilayer Perceptrons (MLPs), explicit pose networks, or ray marching (Yang et al., 2023).

1. Spatio-temporal 4D Gaussian Primitives

Dynamic scenes are represented as a collection of NN anisotropic 4D Gaussian "blobs" in R4\mathbb{R}^4, each parameterized by a mean μi=(μx,μy,μz,μt)R4\mu_i = (\mu_x, \mu_y, \mu_z, \mu_t) \in \mathbb{R}^4 and a covariance ΣiR4×4\Sigma_i \in \mathbb{R}^{4\times4}. The un-normalized density function for a point x=(x,y,z,t)x=(x, y, z, t) is:

pi(x)=exp(12(xμi)Σi1(xμi)),p_i(x) = \exp\left(-\frac{1}{2}(x - \mu_i)^\top \Sigma_i^{-1} (x - \mu_i)\right),

with covariance structured for efficient optimization as Σi=RiSi2Ri\Sigma_i = R_i S_i^2 R_i^\top, where Si=diag(sx,sy,sz,st)S_i=\operatorname{diag}(s_x, s_y, s_z, s_t) is a scale matrix and (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}0 is a 4D rotation, realized via quaternion parameterization. Each 4D Gaussian can be analytically marginalized and conditioned, factoring (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}1 into a temporal marginal (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}2 and a spatial conditional (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}3 with closed-form expressions for their means and covariances. This decomposition facilitates efficient "slicing" of the 4D volume at arbitrary times for rendering. The parametrization enables 4DGS to flexibly capture complex and deformable scene dynamics without explicit deformation modeling.

2. View- and Time-dependent Appearance Modeling

Each Gaussian primitive is enriched with a time- and view-dependent appearance model based on 4D spherindrical harmonics. Specifically, for a viewing direction (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}4 and temporal offset (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}5, the basis functions are:

(x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}6

where (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}7 are standard spherical harmonics and (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}8 is a temporal normalization constant. Each Gaussian learns a set of coefficients (x,y,z,t,θ,ϕ)color(x, y, z, t, \theta, \phi) \mapsto \text{color}9 per basis function, yielding pixel color as:

NN0

The finite-banded construction permits compact modeling of non-Lambertian reflectance, dynamic lighting, and complex appearance variation over time.

3. Feed-forward Inference and Rendering Pipeline

Inference is pose-free and feed-forward, requiring only a timestamp NN1 and camera intrinsics/extrinsics. The pipeline proceeds via:

  • Slicing: Each 4D Gaussian is analytically sliced at NN2 to produce a conditional 3D Gaussian NN3 and a weight NN4, using closed-form marginal and conditional computations.
  • Projection: The 3D conditional Gaussian is mapped to a 2D Gaussian in image coordinates via the linearized camera Jacobian NN5 and extrinsic matrix NN6, yielding projected mean NN7 and covariance NN8.
  • Rasterization: All 2D projected "splats" are blended in a single hierarchical GPU rasterizer pass. For each pixel NN9 at time R4\mathbb{R}^40, the radiance is accumulated as:

R4\mathbb{R}^41

where R4\mathbb{R}^42 denotes each Gaussian's opacity.

This architecture fully eliminates the need for slow ray marching or iterative pose estimation, yielding highly parallelizable and scalable rendering.

4. Training Procedure and Optimization Strategies

Training is performed end-to-end using only differentiable photometric loss (L2 in image space) and occasionally LPIPS. No explicit geometry, deformation, or motion loss is imposed. All Gaussian parameters, including position, scale, rotation quaternions, and spherindrical harmonic appearance coefficients, are jointly optimized from scratch.

To prevent under- or over-reconstruction, Gaussian "splitting" (termed densification) is interleaved during training; new primitives are inserted where the gradient is low, in both spatial and temporal directions, to ensure coverage. Randomized view and timestamp perturbations in sampling batches enforce temporal coherence across dynamic scene events. The entire process supports variable-length video and arbitrary dynamic phenomena without specialized supervision.

5. Real-Time Performance and Comparison with Prior Work

4DGS achieves greater than 100 FPS on consumer GPUs for high-definition (HD) frames by leveraging its GPU-based hierarchical splatting, eschewing computationally intensive ray marching. The approach faithfully captures volumetric effects, dynamic lighting, and non-Lambertian view dependence.

Empirical results show:

Benchmark PSNR DSSIM LPIPS FPS
Multi-view Plenoptic Video 32.01 0.014 0.055 114
Monocular Synthetic D-NeRF 34.09 0.02

On the real multi-view Plenoptic Video benchmark, 4DGS surpasses previous state-of-the-art (SOTA) in both visual quality and speed (prior best PSNR ≲ 31.7 at < 20 FPS). For the monocular synthetic D-NeRF dataset, SOTA metrics are again attained, e.g., SSIM 0.98. Qualitatively, the approach reproduces rapid and fine-scale dynamic details without flicker or temporal artifacts (Yang et al., 2023).

6. Applications, Implications, and Limitations

4DGS is suitable for photorealistic dynamic scene capture and real-time novel-view synthesis in unconstrained settings. Its design affords efficiency, flexibility for arbitrary input video length, and scalability to complex scenes, without requiring explicit deformation tracking or pose estimation networks.

A plausible implication is that the unified 4D primitive representation enables modeling of highly non-rigid motion and fine temporal effects (e.g., flames, smoke, finger movements) previously difficult for both neural implicit and explicit deformation-based models. However, all factual results adhere to photometric and perceptual metrics reported, without direct evaluation of semantic or physical plausibility. The model is trained and validated on both real and synthetic benchmarks, but generalization and limitations outside these domains remain to be evaluated.

7. Relationship to Broader Research Context

4DGS addresses persistent limitations in neural implicit modeling for dynamic scenes: ineffective spatial-temporal structure discovery and the impracticality of explicitly modeling scene element deformations for complex motion. By leveraging 4D volumetric primitives and feed-forward splat-based rendering, it obviates many drawbacks of prior MLP- and deformation-centric methods, providing a conceptually simple, end-to-end differentiable pipeline for real-time dynamic scene reconstruction and synthesis (Yang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pose-free Feed-forward 4D Gaussian Splatting (4DGS).