Systematic evaluation of alternative motion-aware masking strategies

Investigate and quantitatively compare the performance impact of alternative motion-aware visible-token sampling strategies—specifically the motion bins, Bernoulli high/low, sorting, uniform top‑k, and exclude top‑k schemes—within the TrackMAE masked video pretraining framework across different pretraining datasets (such as Kinetics-400 and Something-Something V2) and across diverse downstream tasks, to determine which strategies and hyperparameters yield the most robust and generalizable video representations.

Background

TrackMAE introduces motion-aware masking that leverages point-tracked trajectories to bias the selection of visible tokens during masked video pretraining. While the main experiments use a motion bins strategy that samples visible tokens from high- and low-motion regions, the authors describe several other plausible sampling variants.

In the supplementary discussion, the authors outline additional strategies—Bernoulli high/low sampling from the motion distribution, sorting-based selection, uniform top‑k selection among highest-motion tokens, and exclude top‑k approaches that avoid sampling highest-motion tokens—to potentially influence what the model learns. However, they only evaluate the motion bins strategy and explicitly defer a comprehensive exploration of these alternatives and their cross-dataset, cross-task effects.

References

We leave the exploration of such masking strategies, their impact on different pretraining data and downstream tasks for future work.

TrackMAE: Video Representation Learning via Track Mask and Predict  (2603.27268 - Vandeghen et al., 28 Mar 2026) in Supplementary, Section: Discussion on Motion Masking; Subsection: Sampling Strategy in Masking