Video Masked Autoencoding (MAE)

Updated 16 September 2025

Video MAE is a self-supervised learning approach that masks random spatiotemporal patches in videos to reconstruct missing content.
It uses transformer-based encoder-decoder architectures with advanced masking strategies like adaptive, motion-guided, and text-guided schemes to enhance feature learning.
Empirical results show video MAEs outperform supervised pretraining on benchmarks such as Kinetics-400, achieving higher accuracy with significant computational efficiency.

Video masked-autoencoding (video MAE) is a self-supervised representation learning paradigm in which spatiotemporal regions (patches) of video clips are randomly hidden (“masked”) and a neural autoencoder is trained to reconstruct the original video content from the limited visible information. The extension of masked autoencoding from images to videos unlocks efficient representation learning driven solely by input statistics, with minimal domain-specific inductive biases. While video MAE strategies initially relied on simple random masking, recent advancements have introduced context-aware, motion-guided, and adaptive schemes to improve the informativeness and robustness of learned features. Video MAE frameworks typically leverage high masking ratios (up to 90%), and their encoder–decoder architectures are often transformer-based but sometimes incorporate convolutional backbones or multimodal fusion. Empirical results indicate that video MAEs outperform conventional supervised pretraining on large-scale video recognition tasks, generalize across modalities, and efficiently scale to settings with vast, unlabeled data.

1. Spatiotemporal Masking Strategies

The fundamental innovation introduced by "Masked Autoencoders As Spatiotemporal Learners" (Feichtenhofer et al., 2022) lies in treating a video as a collection of non-overlapping spatiotemporal cubes (patches), which are then randomly masked at a very high ratio (typically 90%). Notably, this random and spacetime-agnostic masking omits any inductive bias concerning spatial or temporal structure—except for the use of patch tokenization and positional embedding. The encoder thus processes only the visible patches, significantly reducing computational load, while the masked regions are reconstructed from these limited cues using a transformer-based lightweight decoder. The optimal masking ratio in videos is empirically validated to be higher than in images (video: 90%; image: 75%), reflecting the greater information redundancy induced by temporal coherence.

Subsequent works have developed specialized masking strategies. Adaptive masking methods (AdaMAE, (Bandara et al., 2022)) use an auxiliary sampling network trained via policy gradient to select visible tokens carrying high semantic or motion information; this allows masking ratios of up to 95%. Motion-guided masking (MotionMAE (Yang et al., 2022), MGMAE (Huang et al., 2023), MGM (Fan et al., 2023)) leverages motion cues (such as optical flow or compressed-domain motion vectors) to focus the selection of visible <-> masked tokens on salient, dynamic regions, reducing temporal information leakage. Specialized approaches for domain adaptation, long videos, or multimodal input may sample regions based on high spatiotemporal change (SurgMAE (Jamal et al., 2023)), synchronize masks across sensor and video modalities (MU-MAE (Liu et al., 8 Aug 2024)), or utilize text-guided saliency cues derived from video captions (Text-Guided Video MAE (Fan et al., 1 Aug 2024)).

2. Encoder–Decoder Architectures and Loss Functions

The canonical architecture builds on the Vision Transformer (ViT) backbone: the encoder receives only visible (unmasked) tokens, while the decoder is typically smaller and receives both encoder outputs and mask tokens to reconstruct the input pixel values of all patches. Spatiotemporal positional embeddings are crucial for aligning cubic tokens across time and space. The core optimization employs a mean squared error (MSE) objective computed over masked regions:

$\mathcal{L} = \frac{1}{N} \sum_{i \ \mathrm{masked}} \| \hat{x}_i - x_i \|^2$

Extensions introduce additional loss components. MotionMAE (Yang et al., 2022) includes a parallel "motion head" to predict frame differences, yielding a composite loss:

$\mathcal{L} = \sum_{i \in \mathcal{M}} \| D_{\text{space}}(i) - x_i \|_2^2 + \lambda \| D_{\text{time}}(i) - m_i \|_2^2$

MV2MAE (Shah et al., 29 Jan 2024) introduces a motion-weighted reconstruction loss: spatial patches with greater temporal difference are upweighted via a softmax temperature, emphasizing dynamic scenes over static backgrounds. SiamMAE (Gupta et al., 2023) utilizes an asymmetric masking scheme across frame pairs, a siamese encoder design, and a cross-attention decoder to learn correspondences.

In multimodal settings, cross-attention fusion modules are employed (MU-MAE (Liu et al., 8 Aug 2024)), or contrastive losses are imposed (ViC-MAE (Hernandez et al., 2023), CrossVideoMAE (Ahamed et al., 8 Feb 2025)) to align global representations between modalities, frames, or augmentation views.

3. Empirical Performance and Computational Efficiency

Video MAE methods have demonstrated competitive and often state-of-the-art results on standard benchmarks such as Kinetics-400 (K400), Something-Something V2 (SSv2), UCF101, and HMDB51. For example, standard MAE pretraining with a ViT-L backbone on Kinetics-400 can yield an absolute improvement of 13% in top-1 accuracy over training from scratch (Feichtenhofer et al., 2022). Adaptive and motion-guided masking schemes further improve accuracy (AdaMAE: 81.7% on K400 (Bandara et al., 2022); MotionMAE: 85.3% on K400, 96.3% on UCF101 (Yang et al., 2022)). Specialized video MAE architectures outperform baselines on domain-specific tasks (SurgMAE: 68.91% mAP on surgical videos with 5% labels (Jamal et al., 2023)), demonstrate improved cross-domain generalization (+4.9% on UCF101 over baselines with MGM (Fan et al., 2023)), and advance multimodal settings (MU-MAE: 80.17% accuracy for one-shot multimodal classification (Liu et al., 8 Aug 2024)).

High masking ratios provide substantial computational benefits; encoding only 10% of tokens achieves a theoretical 7.7× reduction in FLOPs and observed >4× speedup in wall-clock time for video MAE training (Feichtenhofer et al., 2022). Content-aware sampling enables further efficiency, attaining similar accuracy with up to 66% fewer epochs compared to random masking (Fan et al., 2023).

4. Advanced Extensions: Multimodal, Long-Video, and Specialized Domains

Video MAE frameworks have been extended to address multimodality, long-range video understanding, and physiologically specialized signals.

Multimodal architectures (MU-MAE (Liu et al., 8 Aug 2024)) synchronize masking strategies across modalities—tube masking for video and simultaneous masking for all sensor inputs—to facilitate joint spatiotemporal representation and cross-attention fusion.
Long-video MAE approaches (LV-MAE (Naiman et al., 4 Apr 2025)) decouple local (short-span) and global (long-span) dependencies by hierarchically encoding short video segments using pretrained models, reducing them to a sequence of embeddings, then performing masked autoencoding on this much shorter sequence to capture narrative-scale structure. This design delivers efficient training (processing 20-minute videos with ∼24 tokens per clip) and state-of-the-art results on benchmarks such as LVU and Breakfast.
Specialized physiological tasks (Periodic-MAE (Choi et al., 27 Jun 2025)) incorporate periodic masking during pretraining to learn quasi-periodic signals characteristic of remote photoplethysmography (rPPG), enforcing domain-relevant frequency constraints (e.g., bandwidth and spectral peak losses in 0.66–3 Hz for pulse estimation).

5. Masking Guided by Saliency, Motion, or Text Semantics

Recent video MAEs leverage advanced masking algorithms that go beyond simple random sampling:

Motion-guided masking strategies use compressed-domain motion vectors (MGM (Fan et al., 2023)) or optical flow-based warping (MGMAE (Huang et al., 2023)) to construct temporally consistent masks that follow moving regions, reducing information redundancy and leakage and improving representation learning in motion-centric datasets.
Adaptive masking networks (AdaMAE (Bandara et al., 2022), AutoMAE (Chen et al., 2023)) estimate token informativeness and sample tokens to maximize reconstruction error, optimizing selection via reinforcement learning (policy gradient, Gumbel-Softmax).
Text-guided masking (TGM (Fan et al., 1 Aug 2024)) computes cosine similarity between patch features and text embeddings derived from captions (e.g., via CLIP), masking regions that most strongly align with the language description. This approach captures semantic saliency independent of visual motion and yields competitive performance with motion-based algorithms.

Unified frameworks combine generative (MAE) and discriminative (contrastive) objectives (e.g., video-text contrastive InfoNCE losses), which empirically improve downstream performance for both finetuning and linear probe settings, particularly when used with saliency-aware masking strategies.

6. Practical Applications, Impact, and Future Directions

Video MAEs achieve transfer learning superiority compared to supervised pretraining (Feichtenhofer et al., 2022). Performance generalizes robustly to uncurated, real-world datasets (Instagram video), domain-specific activity recognition (surgery, action, physiology), and multimodal fusion settings.

Applications include:

Action recognition, video object segmentation, and pose propagation—benefitting from enhanced temporal dynamics and object-centered features (MotionMAE, SiamMAE (Gupta et al., 2023)).
Multimodal activity analysis—improved by synchronized masking and cross-attention fusion (MU-MAE).
Long video understanding—enabled by hierarchical, segment-wise embeddings (LV-MAE).
Healthcare—rPPG estimation via periodic masking and frequency-domain constraints (Periodic-MAE).

A central conclusion is that masked autoencoding constitutes a unified methodology for self-supervised representation learning across modalities and domains (images, video, text, sensors), effective with minimal prior knowledge. Extensions to multimodal contrastive objectives, adaptive saliency detection, and efficient cross-domain transfer learning suggest ongoing utility in large-scale, data-scarce, and specialized environments.

Future research directions (as identified in the source data) include investigation of alternate motion or semantic cues for masking, optimized decoder architectures and loss weightings, domain adaptation, and tight integration of video MAE frameworks with advanced LLMs and multimodal synthesis. These developments are expected to further enhance the richness and applicability of representations learned via video masked autoencoding.