Video Gaussian Masked Autoencoders

Updated 3 January 2026

The paper introduces a self-supervised framework that represents video frames as 3-D Gaussian splats, enforcing temporal correspondences for zero-shot tracking.
It employs heavy spatio-temporal masking and a differentiable volumetric renderer to reconstruct masked video patches through evolving 3-D Gaussian parameters.
Empirical results show that Video-GMAE outperforms prior methods on datasets like Kubric, DAVIS, and Kinetics, achieving superior zero-shot and fine-tuned tracking performance.

Video Gaussian Masked Autoencoders (Video-GMAE) is a self-supervised representation learning framework that encodes a sequence of video frames as a temporally evolving set of 3-D Gaussian primitives, or "splats." The architecture is built on a spatio-temporal masked autoencoder backbone with a differentiable Gaussian splatting volumetric renderer. By structuring the pretraining task as the reconstruction of masked videos through these moving Gaussians, the model enforces an inductive bias towards learning the underlying dynamic 3-D structure, which in turn facilitates emergent zero-shot object tracking performance and strong transfer to supervised tracking tasks (Baranwal et al., 27 Dec 2025).

1. Spatio-Temporal 3-D Gaussian Representation and Rendering

Each of the $K$ frames in a video is represented as a set of $n$ 3-D Gaussian splats, with each Gaussian $g_i^{(t)}$ parameterized as a 14-dimensional vector:

$g_i = (\mu_i,\, s_i,\, \varphi_i,\, r_i,\, o_i)$

where $\mu_i \in \mathbb{R}^3$ is the 3-D center, $s_i \in \mathbb{R}^3$ defines a diagonal scale matrix $S$ , $\varphi_i \in \mathbb{R}^4$ is a unit quaternion specifying rotation $R$ , $r_i \in \mathbb{R}^3$ is RGB color, and $o_i \in \mathbb{R}$ is opacity.

In the camera's frame, the 3-D Gaussian is projected onto the image plane via standard intrinsic and extrinsic calibration, inducing a 2-D elliptical Gaussian. The projected center $x_i = \Pi(K, [R|t], \mu_i)$ and the projected covariance $\Sigma_i^2$ define a per-pixel contribution:

$G_i(u) = \exp\left(-\frac{1}{2}(u-x_i)^\top \Sigma_i^{-2}(u-x_i)\right)$

Rendering uses differentiable volume compositing to accumulate per-pixel opacities $\alpha_i(u)$ and colors [Kerbl et al. 2023].

The encoder is a Vision Transformer (ViT) processing $K$ video frames (split into $16 \times 16$ spatial patches), using learned spatial and temporal embeddings and heavy spatio-temporal masking. The decoder, a lightweight ViT, receives latent tokens and $M=K \times n$ learnable query tokens to predict Gaussian parameters: the first $n$ decode all parameters for frame $t=1$ , the remaining $(K-1)\times n$ predict residuals $(\Delta \mu_i^{(t)}, \Delta r_i^{(t)})$ for subsequent frames. Residuals are recursively integrated to yield complete splat sets for reconstruction.

2. Masking Strategy and Pretraining Objective

Video-GMAE utilizes 95% random spatio-temporal masking of video patches, following MAE-ST [Feichtenhofer et al. 2022], with only 5% of tokens exposed to the encoder. The decoder must reconstruct both initial-frame Gaussian parameters and per-frame deltas entirely from these visible tokens and learned query tokens.

The reconstruction objective is a per-pixel L2 loss summed over all video frames:

$L_{\text{rec}} = \sum_{t=1}^K \sum_{u\in\text{pixels}} \left\|I_t(u)-\hat{I}_t(u)\right\|_2^2$

While an optional perceptual loss can be added, most gains are attributed to the pixelwise L2 loss. The forced prediction of both full initial Gaussians and their per-frame residuals ensures alignment of Gaussian identities across frames, effectively imposing a temporal correspondence prior.

Masking prevents the model from exploiting pixel-level shortcuts, requiring temporally structured latent representations to accomplish the reconstruction objective.

3. Emergence of Tracking and Zero-Shot Correspondence

On pretrained models, tracking "emerges" by projecting recovered 3-D Gaussian trajectories onto the 2-D image plane for each time step $t$ , yielding $x_i^{(t)} = \Pi(K, [R|t], \mu_i^{(t)})$ . Per-Gaussian pixel displacement vectors $\Delta x_i^{(t)} = x_i^{(t)} - x_i^{(t-1)}$ are encoded as pseudo-RGB flow splats, and a dense optical flow field $F^{(t)}(u)$ is formed via opacity-weighted summation over all $n$ Gaussians.

Given a 2-D query $p^{(0)}$ , zero-shot tracking operates by advecting the point through the flow field and refining via anchor proposals weighted by local Gaussian opacities, using a visibility threshold and blending parameter for robust, occlusion-aware point propagation.

With $(K=8, \tau_{\text{vis}}=0.5, \beta=0.3)$ , the resulting tracker demonstrates robust handling of occlusions and can generalize without fine-tuning to new video data.

4. Empirical Performance and Comparative Evaluation

Zero-Shot Tracking Results

Video-GMAE achieves the following metrics (for stride=5):

Dataset	AJ	$\delta_{\text{avg}}^x$	OA
Kubric	54.3	67.0	91.9
DAVIS	41.3	55.7	85.2
Kinetics	60.1	69.1	90.7

Against the best self-supervised baseline (GMRW-C), Video-GMAE matches or outperforms prior methods on all datasets.

Frozen-Encoder Transfer

In a frozen-encoder setup, compared to MAE-ST and VideoMAE, Video-GMAE exhibits superior performance on both Kinetics and Kubric:

Model	Kinetics (AJ/ $\delta_{\text{avg}}^x$ /OA)	Kubric (AJ/ $\delta_{\text{avg}}^x$ /OA)
MAE-ST	42.3 / 49.8 / 95.4	41.5 / 51.6 / 95.9
VideoMAE	46.9 / 54.3 / 94.9	44.8 / 55.2 / 95.8
Video-GMAE	65.1 / 72.0 / 97.4	62.4 / 71.9 / 96.6

Fine-Tuning

After supervised fine-tuning (TAP-Vid, stride=5), Video-GMAE further improves:

Dataset	AJ	$\delta_{\text{avg}}^x$	OA
Kubric	73.6	82.3	97.5
DAVIS	55.7	66.1	92.1
Kinetics	75.0	81.7	97.7

These results surpass previous supervised and self-supervised approaches.

5. Ablative Analysis and Inductive Priors

Ablation studies corroborate that motion deltas $(\Delta \mu)$ are the principal contributors to downstream tracking capability, with joint $(\Delta \mu + \Delta r)$ ablating to 44.7 AJ on TAP-Vid DAVIS, versus 39.1 AJ for a model with no deltas. Isolating color deltas alone yields a substantially lower score (42.5 AJ), indicating spatial movement is paramount for correspondence.

Frame-length scaling experiments indicate that excessively long temporal horizons during pretraining degrade tracking performance, with an optimal window at $F=4$ to $8$ frames.

The core inductive bias of reconstructing video via 3-D moving Gaussians ensures that each latent Gaussian tracks a coherent scene part over time. This prior aligns the encoder’s features with point correspondences and induces latent spaces amenable to zero-shot any-point tracking.

Limitations include the assumption of a static camera, a fixed 256-Gaussian budget limiting fine spatial fidelity, and diminishing returns for larger $K$ .

6. Context, Applications, and Limitations

By packing a masked spatio-temporal autoencoding objective within a differentiable 3-D Gaussian splatting renderer, Video-GMAE establishes a robust self-supervised paradigm for learning temporally structured video representations. These representations not only yield strong zero-shot and fine-tuned tracking benchmarks, but also explicitly encode dynamic scene structure aligned with physical intuition about videos as 2-D projections of evolving 3-D environments.

A plausible implication is that similar generative priors may benefit other video understanding tasks, although the model’s dependency on static camera assumptions and the constraint on the number of Gaussians may limit applicability in highly dynamic or fine-detail settings. The method’s superiority on tracking benchmarks for both self-supervised and supervised transfer sets a new standard for spatio-temporal representation learning in the context of video tracking (Baranwal et al., 27 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Tracking by Predicting 3-D Gaussians Over Time (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video Gaussian Masked Autoencoders (Video-GMAE).