Multi-Video 4D Gaussian Splatting

Updated 6 December 2025

Multi-Video 4D Gaussian Splatting is a dynamic scene reconstruction method using explicit 4D Gaussian primitives that encode spatial, temporal, and appearance properties.
It employs robust temporal alignment strategies, including coarse-to-fine and sub-frame refinements, to handle unsynchronized multi-view video inputs.
Advanced optimization using multi-view photometric losses and compression techniques achieves state-of-the-art real-time rendering and compact 4D models.

Multi-Video 4D Gaussian Splatting (4DGS) refers to the end-to-end modeling and reconstruction of dynamic scenes from multiple time-resolved video streams, using explicit 4D Gaussian primitives to represent both spatial and temporal variation. 4DGS provides an explicit, differentiable, and real-time-capable spatiotemporal scene parameterization, where each Gaussian encodes geometry, appearance, and time-evolving properties. Recent advances have generalized the original 4DGS paradigm to handle unsynchronized multi-view inputs, memory-efficient representations, and robust alignment strategies, leading to state-of-the-art results in dynamic scene capture, view synthesis, and compact 4D modeling (Xu et al., 14 Nov 2025, Wu et al., 2023, Yang et al., 30 Dec 2024).

1. Spatiotemporal 4D Gaussian Splatting: Model Structure

At the core of Multi-Video 4DGS is a set of explicit Gaussian primitives inhabiting 4D spacetime, parameterized as $(\mu \in \mathbb{R}^4, \Sigma \in \mathbb{R}^{4\times 4}, \alpha, \mathbf{C})$ , where $\mu$ encodes 3D space and time, $\Sigma$ captures oriented space-time anisotropy, and $\alpha$ , $\mathbf{C}$ reflect per-Gaussian opacity and appearance coefficients.

Color prediction is realized via a 4D spherindrical harmonic expansion: $c_i(d, \Delta t) = \sum_{n,l,m} a_{n,l,m}^i Z_{n,l}^m(\Delta t, \theta, \phi)$ with $Z_{n,l}^m(\Delta t, \theta, \phi) = \cos\left(\frac{2\pi n}{T}\Delta t\right) Y_l^m(\theta, \phi)$ , spanning time and viewing direction (Yang et al., 30 Dec 2024, Yang et al., 2023).

The rendering equation first conditions each 4D Gaussian at query time $t$ (reducing it to a 3D marginal), then projects to image space as a 2D Gaussian, and composites all such contributions using classic alpha blending: $\widehat{I}_k = \sum\nolimits_{i=1}^{M} p_i(t) p_i(u,v|t) \alpha_i\, c_i(d, \Delta t) \prod\nolimits_{j<i}[1 - p_j(t) p_j(u,v|t) \alpha_j]$

This explicit structure supports GPU-accelerated splat rasterization at high frame rates (Wu et al., 2023).

2. Multi-Video Pipeline and Temporal Alignment

Standard Multi-Video 4DGS assumes temporally synchronized, calibrated input videos, ingesting all frames and treating the union as a sample of 4D spacetime. Practical deployments, however, often encounter unsynchronized captures due to real-world recording variances. A series of works have addressed this by introducing temporal alignment modules:

Coarse-to-Fine Alignment: Initial integer-frame temporal offsets ( $\Delta t_j$ ) for each camera are estimated by maximizing feature-based geometric consistency via LoFTR+RANSAC on masked foreground regions; see

$\Delta t_j^* = \arg\max_{\Delta t \in [-k, k]} \sum_{i} N_{\mathrm{inlier}-\mathrm{fg}}(I_{\text{ref}}^{t_i}, I_j^{t_i+\Delta t})$

(Xu et al., 14 Nov 2025).

Sub-frame Refinement: Differentiable, continuous shifts $\tau_j \in [-0.5, 0.5]$ per camera are introduced as part of the main photometric loss and optimized jointly:

$t' = t + \Delta t_j^* + \tau_j$

These parameters are efficiently adapted via backpropagation under the global 4DGS photometric loss and regularization.

Integration and Training Strategy: Alignment modules are inserted transparently before and during 4DGS training. State-of-the-art pipelines achieve robust synchronization even with up to 10-frame random jitters, yielding consistent reconstructions and sharp elimination of ghosting artifacts (Xu et al., 14 Nov 2025, Lee et al., 3 Dec 2025).

3. Optimization, Losses, and Regularization

Optimization in 4DGS proceeds by minimizing multi-view photometric and structural losses across all time steps: $\mathcal{L} = \sum_{j,t} \|I_j^t - \widehat{I}_j(P_j, t')\|_2^2 + \lambda_\tau \sum_j \|\tau_j\|^2$ where $\widehat{I}_j$ is the 4DGS-rendered image at temporally corrected query, and $\lambda_\tau$ penalizes drift in sub-frame offsets. The renderer remains differentiable with respect to both model and alignment parameters.

The pipeline typically incorporates:

Foreground Segmentation: Binary masks $M_j^t$ to guide foreground feature matching and restrict dense correspondences to dynamic components (Xu et al., 14 Nov 2025).
Dense Feature Matching: LoFTR descriptors for initial alignment, RANSAC filtering for robust inlier counting.
Regularizers: Weight decay on offsets, constraints on Gaussian scales, and temporal smoothness (when required for highly dynamic content or in alias-free variants) (Chen et al., 23 Nov 2025).

For unsynchronized, unconstrained video sets, alternative pipelines employ dense 4D feature tracks and Fused Gromov-Wasserstein transport for track correspondence and dynamic time warping for global and local temporal alignment—culminating in sub-frame synchronization below $0.26$ frames mean error (Lee et al., 3 Dec 2025).

4. Compression, Scalability, and Efficiency

The fully explicit nature of 4DGS allows for several compression and efficiency strategies:

Anchor-based Predictive Coding (P-4DGS): Scene Gaussians are predicted via a learned mapping from a small set of canonical 3D anchors, with intra- and inter-frame redundancy exploited via spatial groupings and deformation MLPs. Adaptive quantization and context entropy coding reduce model size to $<1\,\mathrm{MB}$ , achieving up to $90\times$ compression on real scenes at state-of-the-art quality (Wang et al., 11 Oct 2025).
Hybrid 3D–4D Representation: Temporally invariant Gaussians, identified via learned time-scale thresholds, are converted to 3D, while full 4D parametrization is reserved for dynamic regions. This reduces memory and computational load by $4\times$ – $10\times$ and accelerates training $3$– $5\times$ without loss of fidelity (Oh et al., 19 May 2025).
Scale-Adaptive Filtering: Maximum-sampling frequency analysis yields per-Gaussian low-pass filters, and scale losses ensure no primitive encodes frequencies above its own Nyquist limit, eliminating speckle, “inflation,” and high-frequency zoom artifacts in multi-view renderings (Chen et al., 23 Nov 2025).
Segmented Residual Learning: Cascaded temporal models (CTRL-GS) hierarchically decompose motion into video-constant, segment-constant, and frame-specific residuals, each parameterized by MLPs, enabling robust modeling of large dynamic variations (Hou et al., 23 May 2025).
Compact Attribute Coding: Residual vector quantization and Huffman coding are employed for high-frequency attributes, with minimal PSNR loss relative to full-precision storage (Yang et al., 30 Dec 2024).

5. Empirical Performance and Benchmarks

Multi-Video 4DGS consistently achieves real-time rendering and high-fidelity novel view synthesis across diverse benchmarks:

Dataset	Baseline	PSNR↑	With Enhancement	PSNR↑ (Δ)	SSIM↑ (Δ)	LPIPS↓ (Δ)
DyNeRF (6 scenes, ~20 views)	4DGaussians	26.94	+Temporal Align (Xu et al., 14 Nov 2025)	28.73 (+1.79dB)	+0.03	-10%
Technicolor (3-view)	4DGS	21.78	GC-4DGS (Li et al., 28 Nov 2025)	23.96 (+2.18dB)	+0.051	-0.114
Panoptic Studio	SyncNeRF	24.3	SyncTrack4D (Lee et al., 3 Dec 2025)	26.3 (+2.0dB)	—	—

These improvements reflect not only increased photometric accuracy but qualitative enhancements: elimination of ghosting, improved dynamic sharpness (e.g., fast spins or fluid motion), and robust temporal coherence. On hardware-limited devices such as the NVIDIA Jetson AGX Orin, multi-video 4DGS can achieve interactive frame rates after cloud-side training (Li et al., 28 Nov 2025).

6. Design Choices, Implementation, and Hyperparameters

Key architecture and implementation choices include:

Coarse search window: Typically $k=10$ for integer frame offsets, balancing maximal expected jitter and computational feasibility (Xu et al., 14 Nov 2025).
Number of reference frames: $N_\text{ref}=20$ –$50$ for temporal alignment.
Regularization strength: Typical $\lambda_\tau=10^{-4}$ for nonsynchronized videos; scale loss weights $\lambda_1=0.1$ for anti-aliasing.
Initialization: Sparse 3D points from SfM for seed positions; foreground masks for focus on dynamic elements.
Differentiable time queries: Either via neural MLPs (HexPlane/voxel-MLP) or as explicit finite-difference gradients in explicit covariance models.
Dense per-video feature aggregation: For synchronization in highly unconstrained, object-agnostic settings (Lee et al., 3 Dec 2025).
Densification/Pruning: On-the-fly splitting of underfit Gaussians and removal of low-significance ones, enabling adaptive fidelity.

Runtime rendering is universally implemented via massively parallel tiled rasterization, supporting $30$–$200$ FPS at moderate-to-high resolution on modern GPUs (Wu et al., 2023, Yang et al., 30 Dec 2024).

7. Current Limitations and Future Directions

While Multi-Video 4DGS offers significant advances over prior radiance field methods and per-frame 3DGS, several challenges remain:

Sensitivity to extreme scene dynamics or occlusions without additional supervision (e.g., depth, optical flow).
Handling very long sequences may require further parameter compression or hybrid representations.
Synchronization of videos with extreme independent jitter, differing frame rates, or drift remains an open research problem, though coarse-to-fine alignment (Xu et al., 14 Nov 2025) and cross-video trajectory matching (Lee et al., 3 Dec 2025) partially address these.
Integration of semantic priors (e.g., segmentation or object detection) and context/self-supervised features can boost temporal and geometric coherence in the presence of challenging scene dynamics (Song et al., 9 Mar 2025).

Ongoing work focuses on more memory-efficient variants, structured anchor-offset models, IoT deployment, and generalization to arbitrary asynchronous inputs.

In summary, Multi-Video 4D Gaussian Splatting unifies dynamic scene modeling, alignment, and rendering across synchronized and unsynchronized multi-view video, providing an explicit, differentiable, and highly efficient framework for spatiotemporal neural scene representation and high-quality 4D reconstruction (Xu et al., 14 Nov 2025, Wu et al., 2023, Yang et al., 30 Dec 2024, Lee et al., 3 Dec 2025).