Cross-Temporal 3D Gaussian Splatting

Updated 6 December 2025

The paper decouples static and dynamic Gaussian parameters, reducing per-primitive storage from O(T) to O(L) for scalable real-time rendering.
It leverages Fourier bases and quaternion interpolation to model time-dependent spatial shifts and rotations, ensuring smooth temporal coherence.
The study further integrates hybrid explicit–implicit techniques and compression schemes to optimize dynamic view synthesis while handling sparse temporal updates.

Cross-Temporal 3D Gaussian Splatting (Cross-Temporal 3DGS) denotes a family of scene modeling techniques that parameterize dynamic 3D environments using collections of anisotropic Gaussian kernels whose spatial parameters and/or appearance attributes are explicitly or implicitly controlled across time. These frameworks address the substantial memory, temporal coherence, and computational bottlenecks of classical 3D Gaussian Splatting (3DGS) when applied beyond static geometries to long, dynamic and/or sparsely sampled sequences. Central to Cross-Temporal 3DGS is the explicit modeling of per-Gaussian spatial and appearance evolution, sharing data-efficient bases and leveraging priors/site histories, with applications ranging from dynamic view synthesis and video editing to robust scene versioning.

1. Dynamic Parameterization of Cross-Temporal 3DGS

The central abstraction in Cross-Temporal 3DGS is the extension of 3DGS’s fixed-primitives into time-varying entities whose spatial location $\mu_n(t)$ and rotation $R_n(t)$ are functions of time, while scale $S_n$ , color coefficients $h_n$ , and opacity $o_n$ remain invariant. The canonical covariance is

$\Sigma_n(t) = R_n(t)\, S_n S_n^\top\, R_n(t)^\top$

ensuring that the 3D elliptical splats adaptively orient in world coordinates per frame. A highly memory-efficient instantiation models

Centers $\mu_n(t)$ with a learned Fourier basis:

$x(t) = w_{x,0} + \sum_{i=1}^{L}\big[w_{x,2i-1}\sin(2\pi i t) + w_{x,2i}\cos(2\pi i t)\big]$

with analogous forms for $y(t),z(t)$ .

Rotations $R_n(t)$ via linearly interpolated quaternion coefficients:

$q_j(t) = w_{qj,0} + w_{qj,1}t$

(for $j \in \{x, y, z, w\}$ ), normalized and mapped to SO(3).

This separation of dynamic from static parameters yields a per-Gaussian storage scaling of $O(L)$ (the number of harmonics), versus $O(T)$ (per-frame expansion), enabling real-time, temporally smooth modeling for arbitrarily long sequences with strict memory budgets (Katsumata et al., 2023).

2. Cross-Temporal Training and Optimization Strategies

Training pipelines for Cross-Temporal 3DGS routinely employ staged optimization reflecting the decoupled parameterization. Static parameters ( $S_n$ , $h_n$ , $o_n$ , intercepts of $\mu_n(0)$ , $q_n(0)$ ) are first fit for a coarse static solution, followed by dynamic expansion for all temporal coefficients. A reconstruction loss combining pixelwise $L_1$ and differentiable SSIM components is typical:

$\mathcal{L}_{\text{recon}} = (1-\lambda)\,\|I_{\text{rend}}-I_{\text{gt}}\|_1 + \lambda\,\mathcal{L}_{\rm D\text{-}SSIM}(I_{\text{rend}},I_{\text{gt}})$

with $\lambda$ empirically set (e.g. $0.2$).

Temporal consistency is further promoted by explicit optical-flow supervision. Ground-truth and pseudo-flow fields are aligned by projecting temporal displacements (from Fourier basis) through the camera Jacobian. The aggregated flow is rendered:

$\hat f_{\text{fwd}} = \sum_{i=1}^N \hat f_{\text{fwd},i} \alpha_i \prod_{j<i}(1-\alpha_j)$

yielding a flow loss term

$\mathcal{L}_{\text{flow}} = \|\hat F - F\|_1$

with strong weighting ( $\lambda_{\text{flow}} = 10^3$ ). Divide–clone–prune routines maintain representational parsimony by splitting high-gradient, large-scale splats, cloning small-scale Gaussians with high error, and removing primitives with negligible opacity (Katsumata et al., 2023).

3. Cross-Temporal Scene and View Updates, Sparse-View Priors

Advanced Cross-Temporal 3DGS paradigms allow the reconstruction and update of scenes at non-continuous timepoints using extremely sparse view sets, leveraging dense historical priors. The method builds on three staged mechanisms (An et al., 29 Nov 2025):

Cross-temporal camera alignment: Rigid+ICP registration synchronizes views and geometry across timestamps $(t_0, t_n)$ .
Interference-based confidence initialization: Regions are classified as unchanged via SSIM-thresholded photometric consistency between historical and updated renders, seeding subsequent updates only in low-confidence (likely changed) areas.
Progressive optimization: Gaussians from prior $G_0$ are iteratively updated for $t_n$ via photometric and confidence-regularized losses, expanding high-confidence supervision patches until convergence.

Average scene update quality metrics surpass naïve baselines and other editing pipelines (e.g., PSNR $23.89$ vs. $15.87$, SSIM $0.864$ vs. $0.683$ under $8$ sparsely sampled views for real scenes) (An et al., 29 Nov 2025).

4. Spectral Methods, Multi-Scale Temporal Encoding, and Hybrid Explicit-Implicit Dynamics

For dynamic scenes with nontrivial deformation spectra, hybrid explicit–implicit Cross-Temporal 3DGS schemes are employed. Spectral-aware Laplacian encoding augments multi-resolution hash grids with temporal modules; per-Gaussian deformation fields are learned through Laplacian or Fourier expansions:

$L(t) = \sum_{k=0}^{K-1}\left[\alpha_k \cos(2\pi f_k t) + \beta_k \sin(2\pi f_k t)\right]$

Adaptive Gaussian splitting via KDTree-guided thresholding concentrates primitives where dynamics and covariance anisotropy are maximal (Zhou et al., 7 Aug 2025).

Explicit–implicit hybrid pipelines combine explicit Gaussians for rasterization, implicit MLPs for decoding complex color/density along spatio-temporal rays, and dynamic attribute vectors $d_i$ to gate local temporal variation. Ablation across spectral, dynamics, and splitting modules quantifies individual impact; all modules are essential for state-of-the-art dynamic scene fidelity (PSNR, SSIM, LPIPS) (Zhou et al., 7 Aug 2025).

5. Hybrid 3D-4D Splatting, Compression, and Rate–Distortion Optimization

To mitigate parameter and memory overhead from unwarranted dynamic modeling of static regions, hybrid Cross-Temporal 3D-4D Gaussian Splatting pipelines adaptively classify and “freeze” static primitives into pure 3D parameterizations, with only dynamic regions retaining full time-varying 4D attributes. Static 4D Gaussians are classified by their temporal extent

$\exp(s_{t,i}) > \tau$

and converted to 3D (discarding $\mu_t$ , $\Sigma_{1:3,4}$ etc.). This yields dramatic savings (e.g., $273$ MB vs. $2.1$ GB, $208$ FPS vs. $114$ FPS, $3\times$ fewer primitives for a $10$ s N3V sequence) without loss in fidelity (Oh et al., 19 May 2025).

Compression-oriented Cross-Temporal 3DGS frameworks such as P-4DGS (Wang et al., 11 Oct 2025) cast the problem in terms of spatial–temporal anchor prediction (akin to video coding), intra-frame spatial prediction, inter-frame deformation MLPs, adaptive quantization, and learned context-based entropy coding. These designs exploit redundancy for sub-$1$ MB footprints, achieving up to $40\times$ – $90\times$ compression with real-time rendering ($200$–$300$ FPS), while maintaining quality competitive with uncompressed dynamic baselines. Training proceeds in discrete stages: canonical anchor learning, quantization-aware adjustments, learned deformation modeling, and final entropy model optimization.

6. Rendering, Temporal Coherence, and Multi-View Fidelity

Rendering in Cross-Temporal 3DGS typically involves frame-wise evaluation of dynamic splat locations and orientations, perspective-correct projection (often matrix inversion-free via Plücker coordinates and analytic bounding box computations (Hahlbohm et al., 10 Oct 2024)), and hybrid transparency schemes (e.g., exact blend for top- $K$ sorted fragments, order-independent “tail” for deeper splats). Such hybrid techniques eliminate “popping” artifacts and enforce cross-frame and multi-view coherence, critical for robust scene exploration and rapid fly-throughs. For image formation,

$C = \sum_i c_i \alpha_i \prod_{j<i}(1-\alpha_j)$

remains the standard compositing function, but the set $\{\mu_i(t), R_i(t)\}$ and blending strategy are cross-temporally coherent (Katsumata et al., 2023, Hahlbohm et al., 10 Oct 2024).

7. Limitations, Challenges, and Future Directions

Despite substantial advances, several limitations persist:

Birth/death/topological transitions are still ill-modeled; further per-Gaussian lifetime flags or appearance models are needed to fully capture dynamic processes such as fluid motion or severe occlusion (Katsumata et al., 2023).
Under extreme view sparsity or rapid structural change, pose alignment and confidence initialization can fail, affecting both update fidelity and temporal stability (An et al., 29 Nov 2025).
Automated handling of lens distortion, scene-wide topology changes, and integration with SLAM/mapping pipelines remain open research topics.
Efficient cross-temporal densification strategies and learned motion regularizers suggest promising directions for entirely anchor- or code-based dynamic 3DGS (Oh et al., 19 May 2025).

These frameworks are integral for cross-temporal digital twins, cultural heritage recovery, video-based scene editing, and compressible large-scale 4D environment modeling, offering scalable, temporally consistent, and real-time solutions for dynamic vision applications.