Papers
Topics
Authors
Recent
2000 character limit reached

SyncTrack4D: 4D Gaussian Splatting Pipeline

Updated 6 December 2025
  • SyncTrack4D is a pipeline for synchronized multi-view 4D Gaussian Splatting that aligns unsynchronized video sets using dense feature tracking and FGW optimal transport.
  • It integrates dense 4D track extraction, dynamic time warping, and continuous spline-based sub-frame synchronization to achieve precise temporal alignment.
  • Empirical evaluations demonstrate sub-frame synchronization accuracy (error <0.26 frames) and high-quality renderings, validated on both synthetic and real-world datasets.

SyncTrack4D is a pipeline for synchronized multi-view 4D Gaussian Splatting (4DGS) from unsynchronized monocular or multi-view video sets. It couples dense 4D track matching, cross-video temporal alignment, and explicit continuous 4D Gaussian scene representations, enabling high-fidelity dynamic scene reconstruction and sub-frame video synchronization without requiring object templates, prior models, or hardware triggers. The framework formalizes cross-video motion alignment using fused Gromov-Wasserstein (FGW) optimal transport and continuous-time trajectory parameterization to achieve robust sub-frame alignment and rendering of dynamic, real-world scenes (Lee et al., 3 Dec 2025).

1. Multi-Video Input and Pipeline Structure

SyncTrack4D operates on VV unsynchronized videos,

V1={I11,,IT11},,VV={I1V,,ITVV},V^1 = \{I^1_1, \ldots, I^1_{T_1}\},\,\ldots,\,V^V = \{I^V_1, \ldots, I^V_{T_V}\},

with known camera intrinsics KvK^v and extrinsics PtvP^v_t. The pipeline consists of the following core stages:

  1. Dense 4D Feature Track Extraction: For each video, 2D pixel tracks (e.g., via SpatialTracker) and optical flows are lifted into initial monocular 4DGS models (based on MoSca). Dense per-video 4D tracks

τiv={qiv(1),...,qiv(Tv)}\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \}

(qiv(t)R3q^v_i(t) \in \mathbb{R}^3) are extracted, each with a fixed feature vector fivRFf^v_i \in \mathbb{R}^F (typically DINOv3 descriptors). Additionally, anchor (scaffold) tracks τ^jv\hat{\tau}^v_j are identified for motion compression.

  1. Cross-Video Correspondence via FGW: Dense matching across videos is achieved by casting the problem as FGW optimal transport using both feature similarity and the structure of track geometries.
  2. Coarse Global Frame-Level Temporal Alignment (DTW): Dynamic Time Warping applies to the matched tracks to globally solve for discrete frame offsets per video.
  3. Sub-Frame Synchronization and Unified 4DGS: Continuous-time spline-based parameterization allows accurate fine alignment (sub-frame offsets) and fuses all video tracks into a single synchronized 4DGS scene, optimized via photometric and geometric objectives.

2. Dense 4D Tracks and Fused Gromov-Wasserstein Correspondence

Each per-video 4D track is

τiv={qiv(1),...,qiv(Tv)},qiv(t)R3,\tau^v_i = \{ q^v_i(1), ..., q^v_i(T_v) \},\quad q^v_i(t) \in \mathbb{R}^3,

with a constant descriptor fivRFf^v_i \in \mathbb{R}^F from DINOv3.

Pairwise FGW Matching:

For a pair of videos (reference vav_a, query vbv_b):

  • Feature Cost: Mijab=1cos(fia,fjb)M^{ab}_{ij} = 1 - \cos(f^a_i, f^b_j).
  • Intra-track Structure: Cijv=maxtqiv(t)qjv(t)2C^v_{ij} = \max_t \| q^v_i(t) - q^v_j(t) \|_2.

FGW optimal transport solves:

γ=argminγΠ(μa,μb)i,j,k,l(CikaCjlb)2γijγkl+αi,jMijabγij+ϵH(γ)\gamma^* = \arg\min_{\gamma \in \Pi(\mu^a, \mu^b)} \sum_{i,j,k,l} (C^a_{ik} - C^b_{jl})^2 \gamma_{ij} \gamma_{kl} + \alpha \sum_{i,j} M^{ab}_{ij} \gamma_{ij} + \epsilon H(\gamma)

with entropic regularization H(γ)H(\gamma) and uniform weights μia=1/Na\mu^a_i = 1/N_a, μjb=1/Nb\mu^b_j = 1/N_b. In practice, Sinkhorn-style iterations (e.g., via the POT library) yield γ\gamma^*, from which top-kk correspondences are discretized via the Hungarian algorithm.

This formulation leverages both geometric and semantic information, enforcing both per-track similarity and matching of motion/scene structure.

3. Temporal Alignment: Frame-Level and Sub-Frame

Global Frame-Level Alignment:

Introduce integer offsets Δtv\Delta t_v per video (Δtref=0\Delta t_{ref}=0 for the reference). Frame-to-frame geometric costs for each matched τiaτjb\tau^a_i \leftrightarrow \tau^b_j are computed as

Duvgeo(t,u)=1M(i,j)Mqia(t)qjb(u)1,D_{uv}^{geo}(t,u) = \frac{1}{|\mathcal{M}|} \sum_{(i,j)\in\mathcal{M}} \| q^a_i(t) - q^b_j(u) \|_1,

where M\mathcal{M} denotes the matched pairs. Dynamic Time Warping identifies a monotonic mapping between sequences to deduce the optimal shift Δtv\Delta t_v, usually by selecting the most frequent offset on the DTW path.

Sub-Frame Synchronization and Spline-Based Trajectory Modeling:

To refine synchronization at sub-frame level, each video’s temporal offset Δtv\Delta t_v is extended to R\mathbb{R}. Anchor (scaffold) tracks are parameterized as cubic Hermite splines:

τ^jv(t)=Spline(t+Δtv;Φjv),\hat{\tau}^v_j(t) = \text{Spline}(t + \Delta t_v; \Phi^v_j),

with control points Φjv\Phi^v_j. This enables interpolation of trajectories at non-integer timesteps, facilitating gradient-based optimization of Δtv\Delta t_v for sub-frame accuracy.

Leaf Gaussian trajectories μi(t)\mu_i(t) are modeled as linear blends of nearby spline anchors, yielding a globally consistent, smoothly time-varying 4D scene graph.

4. 4D Gaussian Splatting Formulation and Joint Optimization

SyncTrack4D’s final stage constructs a unified multi-video 4DGS, with each Gaussian GiG_i defined by:

  • Time-varying mean μi(t)R3\mu_i(t) \in \mathbb{R}^3,
  • Covariance ΣiR3×3\Sigma_i \in \mathbb{R}^{3\times 3},
  • Color ciR3c_i \in \mathbb{R}^3.

Its contribution at (x,t)(x, t):

Gi(x,t)=ciexp(12(xμi(t))Σi1(xμi(t))).G_i(x, t) = c_i \cdot \exp\left( -\frac{1}{2}(x - \mu_i(t))^\top \Sigma_i^{-1} (x - \mu_i(t)) \right).

Videos are rendered by splatting all GiG_i at times t+Δtvt + \Delta t_v into view vv and comparing outputs Rv[;{Gi(,t+Δtv)}]R_v[\cdot; \{G_i(\cdot, t + \Delta t_v)\}] to ground-truth frames ItvI^v_t. The full loss is:

L=λphotoLphoto+λarapLarap+λvelLvel+λaccLacc,L = \lambda_{photo} L_{photo} + \lambda_{arap} L_{arap} + \lambda_{vel} L_{vel} + \lambda_{acc} L_{acc},

where:

  • LphotoL_{photo}: Photometric difference,
  • LarapL_{arap}: As-rigid-as-possible regularization on the spline scaffold,
  • LvelL_{vel}: Velocity smoothness,
  • LaccL_{acc}: Acceleration smoothness.

Optimization variables include Gaussian parameters {μi(),Σi,ci}\{\mu_i(\cdot),\Sigma_i,c_i\}, spline controls {Φjv}\{\Phi^v_j\}, and the sub-frame offsets {Δtv}\{\Delta t_v\}, all jointly optimized by Adam or related optimizers.

5. Empirical Evaluation

SyncTrack4D’s evaluation is conducted on:

  • CMU Panoptic Studio: Large-camera array, multi-human activities, providing challenging real-world dynamic scenes.
  • SyncNeRF Blender: Synthetic benchmark with 14 views and multiple objects (Box/Fox/Deer).

Test sequences are unsynchronized, with artificial offsets up to ±\pm30 frames.

Results:

  • Synchronization Accuracy: Post-alignment, average error is less than 0.26 frames on Panoptic Studio (improved from over 5 frames before alignment).
  • Novel-View Synthesis:
    • Panoptic Studio: PSNR ≈ 26.3, SSIM ≈ 0.88, LPIPS ≈ 0.14.
    • Outperforms SyncNeRF in both synchronized and unsynchronized conditions.
  • Qualitative Observations: Yields temporally coherent 4D reconstructions with smooth cross-view motion; accurately preserves fine temporal details (e.g., fast object/human motions).
Dataset Synchronization Error (frames) PSNR SSIM LPIPS
Panoptic Studio <0.26 ≈26.3 ≈0.88 ≈0.14
SyncNeRF Blender Not specified

These results demonstrate sub-frame accuracy without hardware synchronization or predefined templates.

6. Technical Significance and Context

SyncTrack4D is, to date, the first general 4D Gaussian Splatting framework tailored for unsynchronized video sets without reliance on prior object models or explicit scene segmentations. The key innovation is leveraging dense 4D feature tracks and FGW optimal transport to drive both correspondence and alignment, enabling robust synchronization and scene consolidation across diverse, real-world scenarios. The method’s coarse-to-fine alignment cascade (DTW and continuous spline optimization) supports sub-frame precision in both frame association and 4DGS parameter estimation. This configuration yields high-fidelity, temporally coherent reconstructions suitable for dynamic scene renderings and further research into multi-view temporal alignment (Lee et al., 3 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SyncTrack4D.