Recurrent Video Masked-Autoencoders

Updated 22 December 2025

The paper introduces a recurrent masked-autoencoder that aggregates per-frame tokens using a transformer-based GRU core to capture temporal dynamics.
It employs an asymmetric masking strategy, processing unmasked history frames and predicting heavily masked future frames to ensure efficient, scalable reconstruction.
Empirical results demonstrate state-of-the-art performance in video benchmarks and dense spatial tasks with up to 30× parameter efficiency over comparable models.

Recurrent Video Masked-Autoencoders (RVM) are a video representation learning framework integrating a transformer-based recurrent neural network and leveraging an asymmetric masked prediction task, where the model reconstructs masked future frames from a short history of unmasked frames. This approach is distinguished by its efficient aggregation of dense image features across time, linear computational scaling with temporal horizon, and a unified "generalist" encoder, enabling high performance in video and dense spatial tasks at significantly lower parameter counts than previous state-of-the-art video autoencoders (Zoran et al., 15 Dec 2025).

1. Model Architecture

1.1 Asymmetric Encoder–Decoder Pipeline

RVM operates on two video streams:

Source stream: Builds a recurrent state from $K$ consecutive history frames, processed unmasked through a shared Vision Transformer (ViT) encoder.
Target stream: Contains one or more future frames which are subjected to heavy spatial masking; these are reconstructed by the decoder conditioned on the recurrent state.

Each frame $x \in \mathbb{R}^{H \times W \times 3}$ is divided into non-overlapping $P \times P$ patches, linearly projected to dimension $D$ , yielding $N = (H/P)\,(W/P)$ tokens with added Fourier positional embeddings. The encoder $E$ (standard ViT with $L$ blocks, $H$ heads, MLP ratio $r$ ) independently embeds each frame, providing per-frame tokens $\hat{e}_t \in \mathbb{R}^{(N+1)\times D}$ .

The recurrence is realized by the transformer-based GRU core ("RecurBlock"), which aggregates per-frame tokens into a recurrent state $h_t$ : $\begin{aligned} y_t &= E\bigl(x_t \odot m_t\bigr) \in \mathbb{R}^{(N+1)\times D} \ u_t &= \sigma(W^u_e\,y_t + W^u_h\,h_{t-1}) \ r_t &= \sigma(W^r_e\,y_t + W^r_h\,h_{t-1}) \ \tilde h_t &= \mathrm{Tx}(q = y_t, kv = r_t \odot h_{t-1}) \ h_t &= (1-u_t)\odot h_{t-1} + u_t\odot\tilde h_t \end{aligned}$ where $\sigma$ is the sigmoid, $\mathrm{Tx}$ denotes a lightweight transformer block, and $W$ are trainable projections.

The decoder $D$ (lightweight cross-attn–self-attn transformer) reconstructs pixels $z \in \mathbb{R}^{H\times W\times 3}$ from masked patches, conditioned on the recurrent core output $o_t \equiv h_t$ .

1.2 Asymmetric Masking Strategy

During training, $K$ source frames are unmasked, while target frames at random future intervals are masked at 95% of spatial tokens. The encoder processes unmasked source frames and masked target frames; the decoder receives all target token positions (with masked locations replaced by a learned $[\rm MASK]$ embedding) plus positional encodings, attending over the $K$ recurrent state outputs for pixel reconstruction.

2. Training Objective

RVM utilizes a standard mean squared error over all reconstructed target pixels: $L = \frac{1}{T} \sum_{t=1}^T \|x_{K+t} - z_{K+t}\|_F^2$ where $T$ is the number of target (future) frames and $\|\cdot\|_F^2$ is the pixelwise Frobenius norm. No per-patch normalization is applied; averaging occurs over both time and spatial dimensions.

3. Computational and Parameter Efficiency

3.1 Temporal Complexity

RVM offers linear scaling with respect to temporal length $K$ :

Spatio-temporal transformers (e.g., VideoMAE): full self-attention over $KN$ tokens per clip yields $O((KN)^2)$ per layer.
RVM recurrent core: each frame step only attends between $N$ queries of the current frame and the $N$ keys/values of the previous state, an $O(N^2)$ step for $K$ frames, totaling $O(KN^2)$ , i.e., $O(K)$ temporal scaling.

3.2 Model Size and Efficiency

A summary of parameter counts (in millions):

Model	Small (S)	Base (B)	Large (L)	Huge (H/g)
RVM	34	117	375	743
VideoMAE		87	305	1013
V-JEPA			307
DINOv2			303	1135

RVM-S (34M) matches or surpasses VideoMAE-B (87M) and distilled 4DS-B (91M) without knowledge distillation, corresponding to up to 30x parameter efficiency.

4. Empirical Performance

4.1 Video-Level Tasks

RVM-L and RVM-H achieve state-of-the-art results on standard video benchmarks:

Something–Something v2 Top-1: RVM-L 66.7% (VideoMAE-L 62.7%, V-JEPA-L 66.0%)
Kinetics-700: RVM-L 57.3% (VideoMAE-L 52.5%)
Perception Test point tracking: RVM-L 77.3 (VideoMAE-L 78.3, comparable)
Small models: RVM-S 59.7% SSv2 (SiamMAE-S 56.0%, 4DS-S 39.9%)

4.2 Dense Spatial and Geometric Tasks

RVM leads the "generalist" Pareto frontier across geometry and dense correspondence tasks:

ScanNet AbsRel (depth): RVM-L 0.91 (DINOv2-L 1.02, VideoMAE-L 1.10)
DAVIS J & F (segmentation): RVM-L 66.0% (DINOv2-L 61.7%, VideoMAE-L 54.3%)
VIP mIoU: RVM-L 38.0% (DINOv2-L 40.6%, VideoMAE-L 18.9%)

Average normalized performance for RVM-L/H is approximately 95% of each task's best model, compared to 82% for DINOv2 or VideoMAE large variants.

4.3 Small-Model Regime

RVM-S/B produce strong performance without distillation, outperforming or matching models up to 30x larger (SiamMAE, 4DS) via average normalized metric.

5. Qualitative Feature Analyses

5.1 Feature Visualizations

Unsupervised visualizations (PCA/RGB mapping and k-means clustering) demonstrate:

RVM feature embeddings align with semantically coherent, temporally stable object structures (foreground and background).
Competing models like VideoMAE and DINOv2 display temporal "flicker" and diminished object coherence under the same analysis.

5.2 Long-Horizon Feature Propagation

Label propagation experiments on DAVIS-2017 for sequences >80 frames show RVM features exhibit superior long-term segmentation accuracy retention compared to full-attention video models or frame-independent image models, indicating durable temporal information in the recurrent state.

6. Discussion and Directions

6.1 Notable Contributions

Recurrent video masked-autoencoding with transformer-based GRU aggregation of per-frame tokens.
Asymmetric masking: only history frames are available to the encoder; prediction is made for highly masked future frames, enforcing causal structure.
A "generalist" encoder suitable for both video-level tasks (e.g., action recognition, tracking) and image-level spatial tasks (e.g., depth estimation, correspondence).
Small-model parameter efficiency (%%%%31 $O(K)$ 32%%%% over VideoMAE) achieved without knowledge distillation.
Linear memory/computation scaling and emergent stability over long temporal horizons.

6.2 Limitations

For short clips, recurrence may be less efficient than tube-based attention approaches, since no joint spatio-temporal patching occurs.
Backpropagation through time with full ViT steps presents high memory usage.
No saturation in data scaling observed up to 2 billion video training clips.

6.3 Future Research

Formalizing compute–data scaling laws for optimal resource allocation.
Extending RVM to multi-modal video (e.g., with audio or language) and embodied control.
Incorporating sparse or compressive recurrent updates for ultra-long contexts.
Combining RVM with generative or contrastive objectives to enhance representation quality.

RVM demonstrates that transformer-based recurrence, in combination with pixel-level reconstruction and an asymmetric masking scheme, can deliver both highly efficient and highly general visual representations from large-scale unlabeled video data (Zoran et al., 15 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Recurrent Video Masked Autoencoders (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Recurrent Video Masked-Autoencoders (RVM).