Dynamic 3D Gaussian Splatting

Updated 26 November 2025

Dynamic 3D Gaussian Splatting is a spatio-temporal representation that deforms static 3D Gaussians using neural fields to capture non-rigid dynamic motion.
The method employs MLP-based deformation, trajectory interpolation, and segmentation-based masking to accurately separate dynamic from static scene regions.
Adaptive temporal partitioning and advanced compression techniques ensure efficient training and real-time high-fidelity rendering without significant quality loss.

Dynamic 3D Gaussian Splatting is a class of explicit spatio-temporal scene representations that extend the capabilities of static 3D Gaussian Splatting (3DGS) to model arbitrary dynamic content for high-fidelity, real-time novel view synthesis and reconstruction. These frameworks operate by deforming explicit canonical sets of 3D Gaussians across time, using parameterized neural deformation fields, motion priors, or direct trajectory interpolation, to encode per-frame spatial and appearance changes. Dynamic 3DGS approaches address the major challenges in dynamic scene modeling—including temporal incoherence, computational cost, model redundancy, and motion ambiguity—by partitioning time, leveraging motion cues, and compressing representations while maintaining geometric and photometric accuracy.

1. Mathematical Foundation and Core Model Structures

Dynamic 3DGS begins from the static formulation: a scene is represented by $N$ 3D Gaussians $\{\mu_i, \Sigma_i, \alpha_i, c_i\}_{i=1}^N$ —center, covariance, opacity, color (potentially via spherical harmonics), all fixed. In the dynamic extension, each Gaussian is deformed at each time $t$ to model arbitrary non-rigid motion. The canonical set is warped to each frame by:

MLP-based deformation fields: For each short temporal window $w$ $w$ , a shared MLP $\mathcal{F}_\theta$ $F_{θ}$ predicts displacements $(\Delta x, \Delta r, \Delta s) = \mathcal{F}_\theta(\gamma(x), \gamma(t))$ , using sinusoidal positional encoding $\gamma(\cdot)$ $γ (\cdot)$ . At time $t$ $t$ ,
- $\mu_i(t) = x_i + \Delta x_i$
- $R_i(t) = R(x_i) \cdot \exp(\Delta r_i)$
- $S_i(t) = S(x_i) \odot (1+\Delta s_i)$
Segmentation-based masking: Blending weights $\alpha_i\in[0,1]$ gate the deformation for each Gaussian, enabling explicit disentangling between static ( $\alpha_i\approx 0$ ) and dynamic ( $\alpha_i\to 1$ ) regions.
Trajectory and keypoint interpolation: In explicit approaches, dynamic Gaussians store positions and rotations at sparse “keyframes” and interpolate between them via cubic Hermite splines for position and spherical linear interpolation (Slerp) for rotation (Lee et al., 2024).
Hybrid spatio-temporal (3D/4D): Full 4D Gaussians are used for genuinely dynamic regions, while static regions are represented with unchanging 3D Gaussians to optimize computational efficiency (Oh et al., 19 May 2025, Zhang et al., 2024).

Rendering proceeds by projecting temporally-deformed Gaussians onto the image plane and compositing via alpha blending, using the same differentiable rasterization pipelines as static 3DGS.

2. Temporal Partitioning and Windowed Training Strategies

A central innovation for scalable dynamic 3DGS is partitioning long video sequences into “windows” of manageable temporal extent. For each window:

Adaptive window sizes: Windows are determined by measuring framewise motion with multi-view 2D optical flow; windows are shortened in high-motion regions and extended where static (Shaw et al., 2023).
Per-window canonical representations: Each window is initialized with a COLMAP-derived point cloud, and a dedicated MLP is trained for time-local deformations.
Sliding and overlapping: Windows overlap by one frame for temporal consistency, and each is processed in parallel before sequential fine-tuning.
Motion-aware partitioning: For highly dynamic scenes, Gaussians with large historical spatial displacement and variance are recursively split into temporally smaller segments, each with its own dedicated deformation network (Jiao et al., 27 Aug 2025).

Partitioning enables efficient memory use, robust modeling of geometrically complex regions, and mitigates the “temporal averaging” blur inherent in monolithic deformation fields.

3. Disentangling Static and Dynamic Geometry

Dynamic scenes typically feature imbalanced distributions—many regions exhibit little or no motion, risking wasted model capacity and degraded accuracy. Separation is achieved via:

Tunable MLP conflations: Assign blending weights $\alpha_i$ to each Gaussian, learned via initialization from intensity-difference masks and jointly optimized to focus the deformation field on motion-intensive regions (Shaw et al., 2023).
Foreground/background splits: In sparse multi-view settings, an initial object/background segmentation is performed at $t=0$ , yielding disjoint sets (foreground: full deformation; background: position-only) (Azzarelli et al., 7 Nov 2025).
Mask-driven losses: Foreground and background are trained independently with specialized losses, improving sparse-view training and enabling transparent/segmented reconstruction.
Dynamic masking for SLAM: In tracking and mapping contexts, probabilistic fusion of 2D optical-flow and depth-discontinuity masks yields robust dynamic pixel segmentation used for Gaussian updates and camera pose refinement (Li et al., 6 Jun 2025).

These techniques prevent overfitting on static background, promote sharper dynamic boundaries, and improve overall rendering quality, particularly in scenes with sparse dynamic voxels.

4. Temporal Consistency and Fine-Tuning Schemes

Temporal consistency is essential for cross-window or partitioned models, eliminating inter-frame flicker and seams. SWAGS (Shaw et al., 2023) achieves temporal smoothness via:

Self-supervised consistency loss: On overlapping frames between adjacent windows, random novel view poses are sampled using log-exponential averaging on $SE(3)$ . Both window models render the shared frame, and $L_1$ consistency is enforced on their outputs.
Alternating fine-tuning: During consistency optimization, only canonical Gaussian parameters are updated, freezing MLP weights, and alternating with the standard photometric loss.
Cross-frame consistency for partitions: Partitioned models (e.g., MAPo) minimize discrepancy across partition boundaries by penalizing the $L_1$ difference between renderings from temporally adjacent segments, also anchored to ground-truth images (Jiao et al., 27 Aug 2025).

Empirically, such fine-tuning (a) substantially reduces flicker (FAST-VQA score improvements), (b) improves motion boundary sharpness, and (c) ensures seamless coverage across changing window or segment boundaries.

5. Motion Cues and Optical Flow Integration

Motion-aware approaches incorporate 2D optical flow as an auxiliary or supervisory signal to better align 3D Gaussian movement with true object motion:

3D-2D correspondence: Per-pixel 2D flow is mapped to k-nearest 3D Gaussians in world space, projecting 3D displacements into the image plane for KL-divergence penalties (Guo et al., 2024).
Uncertainty-aware flow losses: The variance of flow supervision is modulated by the contribution of each Gaussian, allowing low-confidence points higher error tolerance.
Dynamic region weighting: Color losses are weighted by “dynamic maps” derived from flow magnitude, focusing gradient propagation on motion-intense regions.
Transient-aware motion injectors: Auxiliary modules infer instantaneous 3D velocity fields for dynamic Gaussians, adding explicit supervision and disambiguating ambiguous scene flow.

Such approaches yield improved dynamic reconstruction, reduced redundancy, and robustness to sparse input and monocular captures.

6. Compression, Grouping, and Scalability

Practical deployment in real-time, low-power contexts demands substantial compression and reduced per-frame inference:

Sensitivity-based pruning: Temporal Hessian sensitivity scores identify and prune Gaussians with low impact on image quality. Annealed timestamp smoothing boosts robustness to pose noise (Tu et al., 9 Jun 2025).
GroupFlow clustering: Trajectory similarity-based clustering assigns Gaussians to motion groups, fitted with per-group SE(3) transforms. This reduces inference from $N$ to $J$ rigid transforms per frame, enabling up to $58\times$ rendering speedup and $18\times$ model compression with minimal fidelity loss.
Gradient-aware quantization: Per-parameter gradient sensitivity controls mixed-precision quantization; unimportant parameters are compressed to low bit-width.
Keypoint trajectory sparsification: Ramer-Douglas-Peucker algorithms select sparse keyframes for explicit trajectory storage, interpolating others at runtime (Javed et al., 2024).

Metrics confirm that strong compression strategies do not significantly degrade visual quality (≤0.1 dB PSNR drop), and global model sizes can be reduced from gigabytes to tens of megabytes, pushing dynamic 3DGS to AR/VR, mobile, and SLAM contexts.

7. Implementation Protocols, Quantitative Results, and Empirical Validation

The protocol for dynamic 3DGS generally involves:

Initialization from sparse point cloud (usually COLMAP).
Parallelized windowed training on multi-GPU clusters, then sequential cross-window fine-tuning.
Adam optimization with independent learning rates per parameter group.
Evaluation on multi-view datasets (Neural 3D Video, Panoptic Sports, DyNeRF, HyperNeRF), tracking (SLAM), and filmmaking (Splatography datasets).

Representative empirical outcomes:

SWAGS (Shaw et al., 2023): Best average PSNR (32.05 dB), SSIM (0.949), second-best LPIPS (0.093), 71.5 FPS, state-of-the-art across dynamic NeRF/grid baselines.
MAPo (Jiao et al., 27 Aug 2025): PSNR 31.26, SSIM 0.943, LPIPS 0.044, doubling inference FPS over non-partitioned baselines.
SpeeDe3DGS (Tu et al., 9 Jun 2025): Rendering speed increased $10.37\times$ , model size down $7.71\times$ , with quantitative fidelity within $0.1$ dB PSNR of baseline.
Splatography (Azzarelli et al., 7 Nov 2025): Up to $+3$ PSNR over prior methods with half the model size under sparse view constraints.
TC3DGS (Javed et al., 2024): Up to $67\times$ compression, $17\%$ – $30\%$ FPS improvement, $<0.4$ dB PSNR drop.

Ablations confirm substantial benefits from tunable MLPs, windowed partitioning, cross-window consistency, and motion-based optimizations. Visualizations consistently show sharper motion boundaries, cleaner background, and robust handling of transient and occluded regions.

Dynamic 3D Gaussian Splatting marks a major step in explicit, differentiable scene modeling, scaling static 3DGS methods to arbitrary dynamic environments by windowed partitioning, motion-aware deformation, segmentation, compression, and real-time optimization. The fusion of explicit geometry, adaptive temporal subdivision, and photometric/supervisory losses defines the current standard for real-time, high-fidelity dynamic novel view synthesis and mapping.