Recurrent Video Masked-Autoencoders
- The paper introduces a recurrent masked-autoencoder that aggregates per-frame tokens using a transformer-based GRU core to capture temporal dynamics.
- It employs an asymmetric masking strategy, processing unmasked history frames and predicting heavily masked future frames to ensure efficient, scalable reconstruction.
- Empirical results demonstrate state-of-the-art performance in video benchmarks and dense spatial tasks with up to 30× parameter efficiency over comparable models.
Recurrent Video Masked-Autoencoders (RVM) are a video representation learning framework integrating a transformer-based recurrent neural network and leveraging an asymmetric masked prediction task, where the model reconstructs masked future frames from a short history of unmasked frames. This approach is distinguished by its efficient aggregation of dense image features across time, linear computational scaling with temporal horizon, and a unified "generalist" encoder, enabling high performance in video and dense spatial tasks at significantly lower parameter counts than previous state-of-the-art video autoencoders (Zoran et al., 15 Dec 2025).
1. Model Architecture
1.1 Asymmetric Encoder–Decoder Pipeline
RVM operates on two video streams:
- Source stream: Builds a recurrent state from consecutive history frames, processed unmasked through a shared Vision Transformer (ViT) encoder.
- Target stream: Contains one or more future frames which are subjected to heavy spatial masking; these are reconstructed by the decoder conditioned on the recurrent state.
Each frame is divided into non-overlapping patches, linearly projected to dimension , yielding tokens with added Fourier positional embeddings. The encoder (standard ViT with blocks, heads, MLP ratio ) independently embeds each frame, providing per-frame tokens .
The recurrence is realized by the transformer-based GRU core ("RecurBlock"), which aggregates per-frame tokens into a recurrent state : where is the sigmoid, denotes a lightweight transformer block, and are trainable projections.
The decoder (lightweight cross-attn–self-attn transformer) reconstructs pixels from masked patches, conditioned on the recurrent core output .
1.2 Asymmetric Masking Strategy
During training, source frames are unmasked, while target frames at random future intervals are masked at 95% of spatial tokens. The encoder processes unmasked source frames and masked target frames; the decoder receives all target token positions (with masked locations replaced by a learned embedding) plus positional encodings, attending over the recurrent state outputs for pixel reconstruction.
2. Training Objective
RVM utilizes a standard mean squared error over all reconstructed target pixels: where is the number of target (future) frames and is the pixelwise Frobenius norm. No per-patch normalization is applied; averaging occurs over both time and spatial dimensions.
3. Computational and Parameter Efficiency
3.1 Temporal Complexity
RVM offers linear scaling with respect to temporal length :
- Spatio-temporal transformers (e.g., VideoMAE): full self-attention over tokens per clip yields per layer.
- RVM recurrent core: each frame step only attends between queries of the current frame and the keys/values of the previous state, an step for frames, totaling , i.e., temporal scaling.
3.2 Model Size and Efficiency
A summary of parameter counts (in millions):
| Model | Small (S) | Base (B) | Large (L) | Huge (H/g) |
|---|---|---|---|---|
| RVM | 34 | 117 | 375 | 743 |
| VideoMAE | 87 | 305 | 1013 | |
| V-JEPA | 307 | |||
| DINOv2 | 303 | 1135 |
RVM-S (34M) matches or surpasses VideoMAE-B (87M) and distilled 4DS-B (91M) without knowledge distillation, corresponding to up to 30x parameter efficiency.
4. Empirical Performance
4.1 Video-Level Tasks
RVM-L and RVM-H achieve state-of-the-art results on standard video benchmarks:
- Something–Something v2 Top-1: RVM-L 66.7% (VideoMAE-L 62.7%, V-JEPA-L 66.0%)
- Kinetics-700: RVM-L 57.3% (VideoMAE-L 52.5%)
- Perception Test point tracking: RVM-L 77.3 (VideoMAE-L 78.3, comparable)
- Small models: RVM-S 59.7% SSv2 (SiamMAE-S 56.0%, 4DS-S 39.9%)
4.2 Dense Spatial and Geometric Tasks
RVM leads the "generalist" Pareto frontier across geometry and dense correspondence tasks:
- ScanNet AbsRel (depth): RVM-L 0.91 (DINOv2-L 1.02, VideoMAE-L 1.10)
- DAVIS J & F (segmentation): RVM-L 66.0% (DINOv2-L 61.7%, VideoMAE-L 54.3%)
- VIP mIoU: RVM-L 38.0% (DINOv2-L 40.6%, VideoMAE-L 18.9%)
Average normalized performance for RVM-L/H is approximately 95% of each task's best model, compared to 82% for DINOv2 or VideoMAE large variants.
4.3 Small-Model Regime
RVM-S/B produce strong performance without distillation, outperforming or matching models up to 30x larger (SiamMAE, 4DS) via average normalized metric.
5. Qualitative Feature Analyses
5.1 Feature Visualizations
Unsupervised visualizations (PCA/RGB mapping and k-means clustering) demonstrate:
- RVM feature embeddings align with semantically coherent, temporally stable object structures (foreground and background).
- Competing models like VideoMAE and DINOv2 display temporal "flicker" and diminished object coherence under the same analysis.
5.2 Long-Horizon Feature Propagation
Label propagation experiments on DAVIS-2017 for sequences >80 frames show RVM features exhibit superior long-term segmentation accuracy retention compared to full-attention video models or frame-independent image models, indicating durable temporal information in the recurrent state.
6. Discussion and Directions
6.1 Notable Contributions
- Recurrent video masked-autoencoding with transformer-based GRU aggregation of per-frame tokens.
- Asymmetric masking: only history frames are available to the encoder; prediction is made for highly masked future frames, enforcing causal structure.
- A "generalist" encoder suitable for both video-level tasks (e.g., action recognition, tracking) and image-level spatial tasks (e.g., depth estimation, correspondence).
- Small-model parameter efficiency (%%%%3132%%%% over VideoMAE) achieved without knowledge distillation.
- Linear memory/computation scaling and emergent stability over long temporal horizons.
6.2 Limitations
- For short clips, recurrence may be less efficient than tube-based attention approaches, since no joint spatio-temporal patching occurs.
- Backpropagation through time with full ViT steps presents high memory usage.
- No saturation in data scaling observed up to 2 billion video training clips.
6.3 Future Research
- Formalizing compute–data scaling laws for optimal resource allocation.
- Extending RVM to multi-modal video (e.g., with audio or language) and embodied control.
- Incorporating sparse or compressive recurrent updates for ultra-long contexts.
- Combining RVM with generative or contrastive objectives to enhance representation quality.
RVM demonstrates that transformer-based recurrence, in combination with pixel-level reconstruction and an asymmetric masking scheme, can deliver both highly efficient and highly general visual representations from large-scale unlabeled video data (Zoran et al., 15 Dec 2025).