SyncTrack4D: 4D Gaussian Splatting Pipeline

Updated 6 December 2025

SyncTrack4D is a pipeline for synchronized multi-view 4D Gaussian Splatting that aligns unsynchronized video sets using dense feature tracking and FGW optimal transport.
It integrates dense 4D track extraction, dynamic time warping, and continuous spline-based sub-frame synchronization to achieve precise temporal alignment.
Empirical evaluations demonstrate sub-frame synchronization accuracy (error <0.26 frames) and high-quality renderings, validated on both synthetic and real-world datasets.

SyncTrack4D is a pipeline for synchronized multi-view 4D Gaussian Splatting (4DGS) from unsynchronized monocular or multi-view video sets. It couples dense 4D track matching, cross-video temporal alignment, and explicit continuous 4D Gaussian scene representations, enabling high-fidelity dynamic scene reconstruction and sub-frame video synchronization without requiring object templates, prior models, or hardware triggers. The framework formalizes cross-video motion alignment using fused Gromov-Wasserstein (FGW) optimal transport and continuous-time trajectory parameterization to achieve robust sub-frame alignment and rendering of dynamic, real-world scenes (Lee et al., 3 Dec 2025).

1. Multi-Video Input and Pipeline Structure

SyncTrack4D operates on $V$ unsynchronized videos,

$V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},$

with known camera intrinsics $K^v$ and extrinsics $P^v_t$ . The pipeline consists of the following core stages:

Dense 4D Feature Track Extraction: For each video, 2D pixel tracks (e.g., via SpatialTracker) and optical flows are lifted into initial monocular 4DGS models (based on MoSca). Dense per-video 4D tracks

$\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}$

( $q^v_i(t) \in \mathbb{R}^3$ ) are extracted, each with a fixed feature vector $f^v_i \in \mathbb{R}^F$ (typically DINOv3 descriptors). Additionally, anchor (scaffold) tracks $\hat{\tau}^v_j$ are identified for motion compression.

Cross-Video Correspondence via FGW: Dense matching across videos is achieved by casting the problem as FGW optimal transport using both feature similarity and the structure of track geometries.
Coarse Global Frame-Level Temporal Alignment (DTW): Dynamic Time Warping applies to the matched tracks to globally solve for discrete frame offsets per video.
Sub-Frame Synchronization and Unified 4DGS: Continuous-time spline-based parameterization allows accurate fine alignment (sub-frame offsets) and fuses all video tracks into a single synchronized 4DGS scene, optimized via photometric and geometric objectives.

2. Dense 4D Tracks and Fused Gromov-Wasserstein Correspondence

Each per-video 4D track is

$\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \},\quad q^v_i(t) \in \mathbb{R}^3,$

with a constant descriptor $f^v_i \in \mathbb{R}^F$ from DINOv3.

Pairwise FGW Matching:

For a pair of videos (reference $v_a$ , query $v_b$ ):

Feature Cost: $M^{ab}_{ij} = 1 - \cos(f^a_i, f^b_j)$ .
Intra-track Structure: $C^v_{ij} = \max_t \| q^v_i(t) - q^v_j(t) \|_2$ .

FGW optimal transport solves:

$\gamma^* = \arg\min_{\gamma \in \Pi(\mu^a, \mu^b)} \sum_{i,j,k,l} (C^a_{ik} - C^b_{jl})^2 \gamma_{ij} \gamma_{kl} + \alpha \sum_{i,j} M^{ab}_{ij} \gamma_{ij} + \epsilon H(\gamma)$

with entropic regularization $H(\gamma)$ and uniform weights $\mu^a_i = 1/N_a$ , $\mu^b_j = 1/N_b$ . In practice, Sinkhorn-style iterations (e.g., via the POT library) yield $\gamma^*$ , from which top- $k$ correspondences are discretized via the Hungarian algorithm.

This formulation leverages both geometric and semantic information, enforcing both per-track similarity and matching of motion/scene structure.

3. Temporal Alignment: Frame-Level and Sub-Frame

Global Frame-Level Alignment:

Introduce integer offsets $\Delta t_v$ per video ( $\Delta t_{ref}=0$ for the reference). Frame-to-frame geometric costs for each matched $\tau^a_i \leftrightarrow \tau^b_j$ are computed as

$D_{uv}^{geo}(t,u) = \frac{1}{|\mathcal{M}|} \sum_{(i,j)\in\mathcal{M}} \| q^a_i(t) - q^b_j(u) \|_1,$

where $\mathcal{M}$ denotes the matched pairs. Dynamic Time Warping identifies a monotonic mapping between sequences to deduce the optimal shift $\Delta t_v$ , usually by selecting the most frequent offset on the DTW path.

Sub-Frame Synchronization and Spline-Based Trajectory Modeling:

To refine synchronization at sub-frame level, each video’s temporal offset $\Delta t_v$ is extended to $\mathbb{R}$ . Anchor (scaffold) tracks are parameterized as cubic Hermite splines:

$\hat{\tau}^v_j(t) = \text{Spline}(t + \Delta t_v; \Phi^v_j),$

with control points $\Phi^v_j$ . This enables interpolation of trajectories at non-integer timesteps, facilitating gradient-based optimization of $\Delta t_v$ for sub-frame accuracy.

Leaf Gaussian trajectories $\mu_i(t)$ are modeled as linear blends of nearby spline anchors, yielding a globally consistent, smoothly time-varying 4D scene graph.

4. 4D Gaussian Splatting Formulation and Joint Optimization

SyncTrack4D’s final stage constructs a unified multi-video 4DGS, with each Gaussian $G_i$ defined by:

Time-varying mean $\mu_i(t) \in \mathbb{R}^3$ ,
Covariance $\Sigma_i \in \mathbb{R}^{3\times 3}$ ,
Color $c_i \in \mathbb{R}^3$ .

Its contribution at $(x, t)$ :

$G_i(x, t) = c_i \cdot \exp\left( -\frac{1}{2}(x - \mu_i(t))^\top \Sigma_i^{-1} (x - \mu_i(t)) \right).$

Videos are rendered by splatting all $G_i$ at times $t + \Delta t_v$ into view $v$ and comparing outputs $R_v[\cdot; \{G_i(\cdot, t + \Delta t_v)\}]$ to ground-truth frames $I^v_t$ . The full loss is:

$L = \lambda_{photo} L_{photo} + \lambda_{arap} L_{arap} + \lambda_{vel} L_{vel} + \lambda_{acc} L_{acc},$

where:

$L_{photo}$ : Photometric difference,
$L_{arap}$ : As-rigid-as-possible regularization on the spline scaffold,
$L_{vel}$ : Velocity smoothness,
$L_{acc}$ : Acceleration smoothness.

Optimization variables include Gaussian parameters $\{\mu_i(\cdot),\Sigma_i,c_i\}$ , spline controls $\{\Phi^v_j\}$ , and the sub-frame offsets $\{\Delta t_v\}$ , all jointly optimized by Adam or related optimizers.

5. Empirical Evaluation

SyncTrack4D’s evaluation is conducted on:

CMU Panoptic Studio: Large-camera array, multi-human activities, providing challenging real-world dynamic scenes.
SyncNeRF Blender: Synthetic benchmark with 14 views and multiple objects (Box/Fox/Deer).

Test sequences are unsynchronized, with artificial offsets up to $\pm$ 30 frames.

Results:

Synchronization Accuracy: Post-alignment, average error is less than 0.26 frames on Panoptic Studio (improved from over 5 frames before alignment).
Novel-View Synthesis:
- Panoptic Studio: PSNR ≈ 26.3, SSIM ≈ 0.88, LPIPS ≈ 0.14.
- Outperforms SyncNeRF in both synchronized and unsynchronized conditions.
Qualitative Observations: Yields temporally coherent 4D reconstructions with smooth cross-view motion; accurately preserves fine temporal details (e.g., fast object/human motions).

Dataset	Synchronization Error (frames)	PSNR	SSIM	LPIPS
Panoptic Studio	<0.26	≈26.3	≈0.88	≈0.14
SyncNeRF Blender	Not specified	–	–	–

These results demonstrate sub-frame accuracy without hardware synchronization or predefined templates.

6. Technical Significance and Context

SyncTrack4D is, to date, the first general 4D Gaussian Splatting framework tailored for unsynchronized video sets without reliance on prior object models or explicit scene segmentations. The key innovation is leveraging dense 4D feature tracks and FGW optimal transport to drive both correspondence and alignment, enabling robust synchronization and scene consolidation across diverse, real-world scenarios. The method’s coarse-to-fine alignment cascade (DTW and continuous spline optimization) supports sub-frame precision in both frame association and 4DGS parameter estimation. This configuration yields high-fidelity, temporally coherent reconstructions suitable for dynamic scene renderings and further research into multi-view temporal alignment (Lee et al., 3 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting (2025)

SyncTrack4D: 4D Gaussian Splatting Pipeline

1. Multi-Video Input and Pipeline Structure

2. Dense 4D Tracks and Fused Gromov-Wasserstein Correspondence

3. Temporal Alignment: Frame-Level and Sub-Frame

4. 4D Gaussian Splatting Formulation and Joint Optimization

5. Empirical Evaluation

6. Technical Significance and Context

Whiteboard

Follow Topic

Continue Learning

SyncTrack4D: 4D Gaussian Splatting Pipeline

1. Multi-Video Input and Pipeline Structure

2. Dense 4D Tracks and Fused Gromov-Wasserstein Correspondence

3. Temporal Alignment: Frame-Level and Sub-Frame

4. 4D Gaussian Splatting Formulation and Joint Optimization

5. Empirical Evaluation

6. Technical Significance and Context

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics