Video Gaussian Masked Autoencoders
- The paper introduces a self-supervised framework that represents video frames as 3-D Gaussian splats, enforcing temporal correspondences for zero-shot tracking.
- It employs heavy spatio-temporal masking and a differentiable volumetric renderer to reconstruct masked video patches through evolving 3-D Gaussian parameters.
- Empirical results show that Video-GMAE outperforms prior methods on datasets like Kubric, DAVIS, and Kinetics, achieving superior zero-shot and fine-tuned tracking performance.
Video Gaussian Masked Autoencoders (Video-GMAE) is a self-supervised representation learning framework that encodes a sequence of video frames as a temporally evolving set of 3-D Gaussian primitives, or "splats." The architecture is built on a spatio-temporal masked autoencoder backbone with a differentiable Gaussian splatting volumetric renderer. By structuring the pretraining task as the reconstruction of masked videos through these moving Gaussians, the model enforces an inductive bias towards learning the underlying dynamic 3-D structure, which in turn facilitates emergent zero-shot object tracking performance and strong transfer to supervised tracking tasks (Baranwal et al., 27 Dec 2025).
1. Spatio-Temporal 3-D Gaussian Representation and Rendering
Each of the frames in a video is represented as a set of 3-D Gaussian splats, with each Gaussian parameterized as a 14-dimensional vector:
where is the 3-D center, defines a diagonal scale matrix , is a unit quaternion specifying rotation , is RGB color, and is opacity.
In the camera's frame, the 3-D Gaussian is projected onto the image plane via standard intrinsic and extrinsic calibration, inducing a 2-D elliptical Gaussian. The projected center and the projected covariance define a per-pixel contribution:
Rendering uses differentiable volume compositing to accumulate per-pixel opacities and colors [Kerbl et al. 2023].
The encoder is a Vision Transformer (ViT) processing video frames (split into spatial patches), using learned spatial and temporal embeddings and heavy spatio-temporal masking. The decoder, a lightweight ViT, receives latent tokens and learnable query tokens to predict Gaussian parameters: the first decode all parameters for frame , the remaining predict residuals for subsequent frames. Residuals are recursively integrated to yield complete splat sets for reconstruction.
2. Masking Strategy and Pretraining Objective
Video-GMAE utilizes 95% random spatio-temporal masking of video patches, following MAE-ST [Feichtenhofer et al. 2022], with only 5% of tokens exposed to the encoder. The decoder must reconstruct both initial-frame Gaussian parameters and per-frame deltas entirely from these visible tokens and learned query tokens.
The reconstruction objective is a per-pixel L2 loss summed over all video frames:
While an optional perceptual loss can be added, most gains are attributed to the pixelwise L2 loss. The forced prediction of both full initial Gaussians and their per-frame residuals ensures alignment of Gaussian identities across frames, effectively imposing a temporal correspondence prior.
Masking prevents the model from exploiting pixel-level shortcuts, requiring temporally structured latent representations to accomplish the reconstruction objective.
3. Emergence of Tracking and Zero-Shot Correspondence
On pretrained models, tracking "emerges" by projecting recovered 3-D Gaussian trajectories onto the 2-D image plane for each time step , yielding . Per-Gaussian pixel displacement vectors are encoded as pseudo-RGB flow splats, and a dense optical flow field is formed via opacity-weighted summation over all Gaussians.
Given a 2-D query , zero-shot tracking operates by advecting the point through the flow field and refining via anchor proposals weighted by local Gaussian opacities, using a visibility threshold and blending parameter for robust, occlusion-aware point propagation.
With , the resulting tracker demonstrates robust handling of occlusions and can generalize without fine-tuning to new video data.
4. Empirical Performance and Comparative Evaluation
Zero-Shot Tracking Results
Video-GMAE achieves the following metrics (for stride=5):
| Dataset | AJ | OA | |
|---|---|---|---|
| Kubric | 54.3 | 67.0 | 91.9 |
| DAVIS | 41.3 | 55.7 | 85.2 |
| Kinetics | 60.1 | 69.1 | 90.7 |
Against the best self-supervised baseline (GMRW-C), Video-GMAE matches or outperforms prior methods on all datasets.
Frozen-Encoder Transfer
In a frozen-encoder setup, compared to MAE-ST and VideoMAE, Video-GMAE exhibits superior performance on both Kinetics and Kubric:
| Model | Kinetics (AJ//OA) | Kubric (AJ//OA) |
|---|---|---|
| MAE-ST | 42.3 / 49.8 / 95.4 | 41.5 / 51.6 / 95.9 |
| VideoMAE | 46.9 / 54.3 / 94.9 | 44.8 / 55.2 / 95.8 |
| Video-GMAE | 65.1 / 72.0 / 97.4 | 62.4 / 71.9 / 96.6 |
Fine-Tuning
After supervised fine-tuning (TAP-Vid, stride=5), Video-GMAE further improves:
| Dataset | AJ | OA | |
|---|---|---|---|
| Kubric | 73.6 | 82.3 | 97.5 |
| DAVIS | 55.7 | 66.1 | 92.1 |
| Kinetics | 75.0 | 81.7 | 97.7 |
These results surpass previous supervised and self-supervised approaches.
5. Ablative Analysis and Inductive Priors
Ablation studies corroborate that motion deltas are the principal contributors to downstream tracking capability, with joint ablating to 44.7 AJ on TAP-Vid DAVIS, versus 39.1 AJ for a model with no deltas. Isolating color deltas alone yields a substantially lower score (42.5 AJ), indicating spatial movement is paramount for correspondence.
Frame-length scaling experiments indicate that excessively long temporal horizons during pretraining degrade tracking performance, with an optimal window at to $8$ frames.
The core inductive bias of reconstructing video via 3-D moving Gaussians ensures that each latent Gaussian tracks a coherent scene part over time. This prior aligns the encoder’s features with point correspondences and induces latent spaces amenable to zero-shot any-point tracking.
Limitations include the assumption of a static camera, a fixed 256-Gaussian budget limiting fine spatial fidelity, and diminishing returns for larger .
6. Context, Applications, and Limitations
By packing a masked spatio-temporal autoencoding objective within a differentiable 3-D Gaussian splatting renderer, Video-GMAE establishes a robust self-supervised paradigm for learning temporally structured video representations. These representations not only yield strong zero-shot and fine-tuned tracking benchmarks, but also explicitly encode dynamic scene structure aligned with physical intuition about videos as 2-D projections of evolving 3-D environments.
A plausible implication is that similar generative priors may benefit other video understanding tasks, although the model’s dependency on static camera assumptions and the constraint on the number of Gaussians may limit applicability in highly dynamic or fine-detail settings. The method’s superiority on tracking benchmarks for both self-supervised and supervised transfer sets a new standard for spatio-temporal representation learning in the context of video tracking (Baranwal et al., 27 Dec 2025).