Papers
Topics
Authors
Recent
Search
2000 character limit reached

Video Gaussian Masked Autoencoders

Updated 3 January 2026
  • The paper introduces a self-supervised framework that represents video frames as 3-D Gaussian splats, enforcing temporal correspondences for zero-shot tracking.
  • It employs heavy spatio-temporal masking and a differentiable volumetric renderer to reconstruct masked video patches through evolving 3-D Gaussian parameters.
  • Empirical results show that Video-GMAE outperforms prior methods on datasets like Kubric, DAVIS, and Kinetics, achieving superior zero-shot and fine-tuned tracking performance.

Video Gaussian Masked Autoencoders (Video-GMAE) is a self-supervised representation learning framework that encodes a sequence of video frames as a temporally evolving set of 3-D Gaussian primitives, or "splats." The architecture is built on a spatio-temporal masked autoencoder backbone with a differentiable Gaussian splatting volumetric renderer. By structuring the pretraining task as the reconstruction of masked videos through these moving Gaussians, the model enforces an inductive bias towards learning the underlying dynamic 3-D structure, which in turn facilitates emergent zero-shot object tracking performance and strong transfer to supervised tracking tasks (Baranwal et al., 27 Dec 2025).

1. Spatio-Temporal 3-D Gaussian Representation and Rendering

Each of the KK frames in a video is represented as a set of nn 3-D Gaussian splats, with each Gaussian gi(t)g_i^{(t)} parameterized as a 14-dimensional vector:

gi=(μi,si,φi,ri,oi)g_i = (\mu_i,\, s_i,\, \varphi_i,\, r_i,\, o_i)

where μiR3\mu_i \in \mathbb{R}^3 is the 3-D center, siR3s_i \in \mathbb{R}^3 defines a diagonal scale matrix SS, φiR4\varphi_i \in \mathbb{R}^4 is a unit quaternion specifying rotation RR, riR3r_i \in \mathbb{R}^3 is RGB color, and oiRo_i \in \mathbb{R} is opacity.

In the camera's frame, the 3-D Gaussian is projected onto the image plane via standard intrinsic and extrinsic calibration, inducing a 2-D elliptical Gaussian. The projected center xi=Π(K,[Rt],μi)x_i = \Pi(K, [R|t], \mu_i) and the projected covariance Σi2\Sigma_i^2 define a per-pixel contribution:

Gi(u)=exp(12(uxi)Σi2(uxi))G_i(u) = \exp\left(-\frac{1}{2}(u-x_i)^\top \Sigma_i^{-2}(u-x_i)\right)

Rendering uses differentiable volume compositing to accumulate per-pixel opacities αi(u)\alpha_i(u) and colors [Kerbl et al. 2023].

The encoder is a Vision Transformer (ViT) processing KK video frames (split into 16×1616 \times 16 spatial patches), using learned spatial and temporal embeddings and heavy spatio-temporal masking. The decoder, a lightweight ViT, receives latent tokens and M=K×nM=K \times n learnable query tokens to predict Gaussian parameters: the first nn decode all parameters for frame t=1t=1, the remaining (K1)×n(K-1)\times n predict residuals (Δμi(t),Δri(t))(\Delta \mu_i^{(t)}, \Delta r_i^{(t)}) for subsequent frames. Residuals are recursively integrated to yield complete splat sets for reconstruction.

2. Masking Strategy and Pretraining Objective

Video-GMAE utilizes 95% random spatio-temporal masking of video patches, following MAE-ST [Feichtenhofer et al. 2022], with only 5% of tokens exposed to the encoder. The decoder must reconstruct both initial-frame Gaussian parameters and per-frame deltas entirely from these visible tokens and learned query tokens.

The reconstruction objective is a per-pixel L2 loss summed over all video frames:

Lrec=t=1KupixelsIt(u)I^t(u)22L_{\text{rec}} = \sum_{t=1}^K \sum_{u\in\text{pixels}} \left\|I_t(u)-\hat{I}_t(u)\right\|_2^2

While an optional perceptual loss can be added, most gains are attributed to the pixelwise L2 loss. The forced prediction of both full initial Gaussians and their per-frame residuals ensures alignment of Gaussian identities across frames, effectively imposing a temporal correspondence prior.

Masking prevents the model from exploiting pixel-level shortcuts, requiring temporally structured latent representations to accomplish the reconstruction objective.

3. Emergence of Tracking and Zero-Shot Correspondence

On pretrained models, tracking "emerges" by projecting recovered 3-D Gaussian trajectories onto the 2-D image plane for each time step tt, yielding xi(t)=Π(K,[Rt],μi(t))x_i^{(t)} = \Pi(K, [R|t], \mu_i^{(t)}). Per-Gaussian pixel displacement vectors Δxi(t)=xi(t)xi(t1)\Delta x_i^{(t)} = x_i^{(t)} - x_i^{(t-1)} are encoded as pseudo-RGB flow splats, and a dense optical flow field F(t)(u)F^{(t)}(u) is formed via opacity-weighted summation over all nn Gaussians.

Given a 2-D query p(0)p^{(0)}, zero-shot tracking operates by advecting the point through the flow field and refining via anchor proposals weighted by local Gaussian opacities, using a visibility threshold and blending parameter for robust, occlusion-aware point propagation.

With (K=8,τvis=0.5,β=0.3)(K=8, \tau_{\text{vis}}=0.5, \beta=0.3), the resulting tracker demonstrates robust handling of occlusions and can generalize without fine-tuning to new video data.

4. Empirical Performance and Comparative Evaluation

Zero-Shot Tracking Results

Video-GMAE achieves the following metrics (for stride=5):

Dataset AJ δavgx\delta_{\text{avg}}^x OA
Kubric 54.3 67.0 91.9
DAVIS 41.3 55.7 85.2
Kinetics 60.1 69.1 90.7

Against the best self-supervised baseline (GMRW-C), Video-GMAE matches or outperforms prior methods on all datasets.

Frozen-Encoder Transfer

In a frozen-encoder setup, compared to MAE-ST and VideoMAE, Video-GMAE exhibits superior performance on both Kinetics and Kubric:

Model Kinetics (AJ/δavgx\delta_{\text{avg}}^x/OA) Kubric (AJ/δavgx\delta_{\text{avg}}^x/OA)
MAE-ST 42.3 / 49.8 / 95.4 41.5 / 51.6 / 95.9
VideoMAE 46.9 / 54.3 / 94.9 44.8 / 55.2 / 95.8
Video-GMAE 65.1 / 72.0 / 97.4 62.4 / 71.9 / 96.6

Fine-Tuning

After supervised fine-tuning (TAP-Vid, stride=5), Video-GMAE further improves:

Dataset AJ δavgx\delta_{\text{avg}}^x OA
Kubric 73.6 82.3 97.5
DAVIS 55.7 66.1 92.1
Kinetics 75.0 81.7 97.7

These results surpass previous supervised and self-supervised approaches.

5. Ablative Analysis and Inductive Priors

Ablation studies corroborate that motion deltas (Δμ)(\Delta \mu) are the principal contributors to downstream tracking capability, with joint (Δμ+Δr)(\Delta \mu + \Delta r) ablating to 44.7 AJ on TAP-Vid DAVIS, versus 39.1 AJ for a model with no deltas. Isolating color deltas alone yields a substantially lower score (42.5 AJ), indicating spatial movement is paramount for correspondence.

Frame-length scaling experiments indicate that excessively long temporal horizons during pretraining degrade tracking performance, with an optimal window at F=4F=4 to $8$ frames.

The core inductive bias of reconstructing video via 3-D moving Gaussians ensures that each latent Gaussian tracks a coherent scene part over time. This prior aligns the encoder’s features with point correspondences and induces latent spaces amenable to zero-shot any-point tracking.

Limitations include the assumption of a static camera, a fixed 256-Gaussian budget limiting fine spatial fidelity, and diminishing returns for larger KK.

6. Context, Applications, and Limitations

By packing a masked spatio-temporal autoencoding objective within a differentiable 3-D Gaussian splatting renderer, Video-GMAE establishes a robust self-supervised paradigm for learning temporally structured video representations. These representations not only yield strong zero-shot and fine-tuned tracking benchmarks, but also explicitly encode dynamic scene structure aligned with physical intuition about videos as 2-D projections of evolving 3-D environments.

A plausible implication is that similar generative priors may benefit other video understanding tasks, although the model’s dependency on static camera assumptions and the constraint on the number of Gaussians may limit applicability in highly dynamic or fine-detail settings. The method’s superiority on tracking benchmarks for both self-supervised and supervised transfer sets a new standard for spatio-temporal representation learning in the context of video tracking (Baranwal et al., 27 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video Gaussian Masked Autoencoders (Video-GMAE).