4DGS: Pose-free Feed-forward 4D Gaussian Splatting

Updated 9 April 2026

The paper introduces a pose-free, feed-forward paradigm that models dynamic scenes as continuous 4D volumes using anisotropic Gaussian primitives.
It employs 4D spherindrical harmonics for view- and time-dependent appearance, enabling the accurate capture of non-Lambertian effects and dynamic lighting.
Real-time performance above 100 FPS on consumer GPUs is achieved without ray marching, ensuring temporally coherent and high-definition novel view synthesis.

Pose-free Feed-forward 4D Gaussian Splatting (4DGS) is a data-driven paradigm for real-time photorealistic dynamic scene representation and rendering. 4DGS treats a dynamic scene as a continuous 4D spatio-temporal volume and models its radiance function as a mapping from $\mathbb{R}^4 \times \mathbb{S}^2 \rightarrow \mathbb{R}^3$ , that is, $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ , which is approximated by a finite set of anisotropic 4D Gaussian primitives carrying view- and time-dependent appearance coefficients. The system enables efficient real-time synthesis of temporally coherent, photorealistic novel views from either monocular or multi-view video, using a fully differentiable and end-to-end trainable feed-forward inference pipeline without Multilayer Perceptrons (MLPs), explicit pose networks, or ray marching (Yang et al., 2023).

1. Spatio-temporal 4D Gaussian Primitives

Dynamic scenes are represented as a collection of $N$ anisotropic 4D Gaussian "blobs" in $\mathbb{R}^4$ , each parameterized by a mean $\mu_i = (\mu_x, \mu_y, \mu_z, \mu_t) \in \mathbb{R}^4$ and a covariance $\Sigma_i \in \mathbb{R}^{4\times4}$ . The un-normalized density function for a point $x=(x, y, z, t)$ is:

$p_i(x) = \exp\left(-\frac{1}{2}(x - \mu_i)^\top \Sigma_i^{-1} (x - \mu_i)\right),$

with covariance structured for efficient optimization as $\Sigma_i = R_i S_i^2 R_i^\top$ , where $S_i=\operatorname{diag}(s_x, s_y, s_z, s_t)$ is a scale matrix and $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 0 is a 4D rotation, realized via quaternion parameterization. Each 4D Gaussian can be analytically marginalized and conditioned, factoring $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 1 into a temporal marginal $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 2 and a spatial conditional $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 3 with closed-form expressions for their means and covariances. This decomposition facilitates efficient "slicing" of the 4D volume at arbitrary times for rendering. The parametrization enables 4DGS to flexibly capture complex and deformable scene dynamics without explicit deformation modeling.

2. View- and Time-dependent Appearance Modeling

Each Gaussian primitive is enriched with a time- and view-dependent appearance model based on 4D spherindrical harmonics. Specifically, for a viewing direction $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 4 and temporal offset $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 5, the basis functions are:

$(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 6

where $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 7 are standard spherical harmonics and $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 8 is a temporal normalization constant. Each Gaussian learns a set of coefficients $(x, y, z, t, \theta, \phi) \mapsto \text{color}$ 9 per basis function, yielding pixel color as:

$N$ 0

The finite-banded construction permits compact modeling of non-Lambertian reflectance, dynamic lighting, and complex appearance variation over time.

3. Feed-forward Inference and Rendering Pipeline

Inference is pose-free and feed-forward, requiring only a timestamp $N$ 1 and camera intrinsics/extrinsics. The pipeline proceeds via:

Slicing: Each 4D Gaussian is analytically sliced at $N$ 2 to produce a conditional 3D Gaussian $N$ 3 and a weight $N$ 4, using closed-form marginal and conditional computations.
Projection: The 3D conditional Gaussian is mapped to a 2D Gaussian in image coordinates via the linearized camera Jacobian $N$ 5 and extrinsic matrix $N$ 6, yielding projected mean $N$ 7 and covariance $N$ 8.
Rasterization: All 2D projected "splats" are blended in a single hierarchical GPU rasterizer pass. For each pixel $N$ 9 at time $\mathbb{R}^4$ 0, the radiance is accumulated as:

$\mathbb{R}^4$ 1

where $\mathbb{R}^4$ 2 denotes each Gaussian's opacity.

This architecture fully eliminates the need for slow ray marching or iterative pose estimation, yielding highly parallelizable and scalable rendering.

4. Training Procedure and Optimization Strategies

Training is performed end-to-end using only differentiable photometric loss (L2 in image space) and occasionally LPIPS. No explicit geometry, deformation, or motion loss is imposed. All Gaussian parameters, including position, scale, rotation quaternions, and spherindrical harmonic appearance coefficients, are jointly optimized from scratch.

To prevent under- or over-reconstruction, Gaussian "splitting" (termed densification) is interleaved during training; new primitives are inserted where the gradient is low, in both spatial and temporal directions, to ensure coverage. Randomized view and timestamp perturbations in sampling batches enforce temporal coherence across dynamic scene events. The entire process supports variable-length video and arbitrary dynamic phenomena without specialized supervision.

5. Real-Time Performance and Comparison with Prior Work

4DGS achieves greater than 100 FPS on consumer GPUs for high-definition (HD) frames by leveraging its GPU-based hierarchical splatting, eschewing computationally intensive ray marching. The approach faithfully captures volumetric effects, dynamic lighting, and non-Lambertian view dependence.

Empirical results show:

Benchmark	PSNR	DSSIM	LPIPS	FPS
Multi-view Plenoptic Video	32.01	0.014	0.055	114
Monocular Synthetic D-NeRF	34.09	—	0.02	—

On the real multi-view Plenoptic Video benchmark, 4DGS surpasses previous state-of-the-art (SOTA) in both visual quality and speed (prior best PSNR ≲ 31.7 at < 20 FPS). For the monocular synthetic D-NeRF dataset, SOTA metrics are again attained, e.g., SSIM 0.98. Qualitatively, the approach reproduces rapid and fine-scale dynamic details without flicker or temporal artifacts (Yang et al., 2023).

6. Applications, Implications, and Limitations

4DGS is suitable for photorealistic dynamic scene capture and real-time novel-view synthesis in unconstrained settings. Its design affords efficiency, flexibility for arbitrary input video length, and scalability to complex scenes, without requiring explicit deformation tracking or pose estimation networks.

A plausible implication is that the unified 4D primitive representation enables modeling of highly non-rigid motion and fine temporal effects (e.g., flames, smoke, finger movements) previously difficult for both neural implicit and explicit deformation-based models. However, all factual results adhere to photometric and perceptual metrics reported, without direct evaluation of semantic or physical plausibility. The model is trained and validated on both real and synthetic benchmarks, but generalization and limitations outside these domains remain to be evaluated.

7. Relationship to Broader Research Context

4DGS addresses persistent limitations in neural implicit modeling for dynamic scenes: ineffective spatial-temporal structure discovery and the impracticality of explicitly modeling scene element deformations for complex motion. By leveraging 4D volumetric primitives and feed-forward splat-based rendering, it obviates many drawbacks of prior MLP- and deformation-centric methods, providing a conceptually simple, end-to-end differentiable pipeline for real-time dynamic scene reconstruction and synthesis (Yang et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pose-free Feed-forward 4D Gaussian Splatting (4DGS).