Papers
Topics
Authors
Recent
2000 character limit reached

Recurrent Video Masked-Autoencoders

Updated 22 December 2025
  • The paper introduces a recurrent masked-autoencoder that aggregates per-frame tokens using a transformer-based GRU core to capture temporal dynamics.
  • It employs an asymmetric masking strategy, processing unmasked history frames and predicting heavily masked future frames to ensure efficient, scalable reconstruction.
  • Empirical results demonstrate state-of-the-art performance in video benchmarks and dense spatial tasks with up to 30× parameter efficiency over comparable models.

Recurrent Video Masked-Autoencoders (RVM) are a video representation learning framework integrating a transformer-based recurrent neural network and leveraging an asymmetric masked prediction task, where the model reconstructs masked future frames from a short history of unmasked frames. This approach is distinguished by its efficient aggregation of dense image features across time, linear computational scaling with temporal horizon, and a unified "generalist" encoder, enabling high performance in video and dense spatial tasks at significantly lower parameter counts than previous state-of-the-art video autoencoders (Zoran et al., 15 Dec 2025).

1. Model Architecture

1.1 Asymmetric Encoder–Decoder Pipeline

RVM operates on two video streams:

  • Source stream: Builds a recurrent state from KK consecutive history frames, processed unmasked through a shared Vision Transformer (ViT) encoder.
  • Target stream: Contains one or more future frames which are subjected to heavy spatial masking; these are reconstructed by the decoder conditioned on the recurrent state.

Each frame x∈RH×W×3x \in \mathbb{R}^{H \times W \times 3} is divided into non-overlapping P×PP \times P patches, linearly projected to dimension DD, yielding N=(H/P) (W/P)N = (H/P)\,(W/P) tokens with added Fourier positional embeddings. The encoder EE (standard ViT with LL blocks, HH heads, MLP ratio rr) independently embeds each frame, providing per-frame tokens e^t∈R(N+1)×D\hat{e}_t \in \mathbb{R}^{(N+1)\times D}.

The recurrence is realized by the transformer-based GRU core ("RecurBlock"), which aggregates per-frame tokens into a recurrent state hth_t: yt=E(xt⊙mt)∈R(N+1)×D ut=σ(Weu yt+Whu ht−1) rt=σ(Wer yt+Whr ht−1) h~t=Tx(q=yt,kv=rt⊙ht−1) ht=(1−ut)⊙ht−1+ut⊙h~t\begin{aligned} y_t &= E\bigl(x_t \odot m_t\bigr) \in \mathbb{R}^{(N+1)\times D} \ u_t &= \sigma(W^u_e\,y_t + W^u_h\,h_{t-1}) \ r_t &= \sigma(W^r_e\,y_t + W^r_h\,h_{t-1}) \ \tilde h_t &= \mathrm{Tx}(q = y_t, kv = r_t \odot h_{t-1}) \ h_t &= (1-u_t)\odot h_{t-1} + u_t\odot\tilde h_t \end{aligned} where σ\sigma is the sigmoid, Tx\mathrm{Tx} denotes a lightweight transformer block, and WW are trainable projections.

The decoder DD (lightweight cross-attn–self-attn transformer) reconstructs pixels z∈RH×W×3z \in \mathbb{R}^{H\times W\times 3} from masked patches, conditioned on the recurrent core output ot≡hto_t \equiv h_t.

1.2 Asymmetric Masking Strategy

During training, KK source frames are unmasked, while target frames at random future intervals are masked at 95% of spatial tokens. The encoder processes unmasked source frames and masked target frames; the decoder receives all target token positions (with masked locations replaced by a learned [MASK][\rm MASK] embedding) plus positional encodings, attending over the KK recurrent state outputs for pixel reconstruction.

2. Training Objective

RVM utilizes a standard mean squared error over all reconstructed target pixels: L=1T∑t=1T∥xK+t−zK+t∥F2L = \frac{1}{T} \sum_{t=1}^T \|x_{K+t} - z_{K+t}\|_F^2 where TT is the number of target (future) frames and ∥⋅∥F2\|\cdot\|_F^2 is the pixelwise Frobenius norm. No per-patch normalization is applied; averaging occurs over both time and spatial dimensions.

3. Computational and Parameter Efficiency

3.1 Temporal Complexity

RVM offers linear scaling with respect to temporal length KK:

  • Spatio-temporal transformers (e.g., VideoMAE): full self-attention over KNKN tokens per clip yields O((KN)2)O((KN)^2) per layer.
  • RVM recurrent core: each frame step only attends between NN queries of the current frame and the NN keys/values of the previous state, an O(N2)O(N^2) step for KK frames, totaling O(KN2)O(KN^2), i.e., O(K)O(K) temporal scaling.

3.2 Model Size and Efficiency

A summary of parameter counts (in millions):

Model Small (S) Base (B) Large (L) Huge (H/g)
RVM 34 117 375 743
VideoMAE 87 305 1013
V-JEPA 307
DINOv2 303 1135

RVM-S (34M) matches or surpasses VideoMAE-B (87M) and distilled 4DS-B (91M) without knowledge distillation, corresponding to up to 30x parameter efficiency.

4. Empirical Performance

4.1 Video-Level Tasks

RVM-L and RVM-H achieve state-of-the-art results on standard video benchmarks:

  • Something–Something v2 Top-1: RVM-L 66.7% (VideoMAE-L 62.7%, V-JEPA-L 66.0%)
  • Kinetics-700: RVM-L 57.3% (VideoMAE-L 52.5%)
  • Perception Test point tracking: RVM-L 77.3 (VideoMAE-L 78.3, comparable)
  • Small models: RVM-S 59.7% SSv2 (SiamMAE-S 56.0%, 4DS-S 39.9%)

4.2 Dense Spatial and Geometric Tasks

RVM leads the "generalist" Pareto frontier across geometry and dense correspondence tasks:

  • ScanNet AbsRel (depth): RVM-L 0.91 (DINOv2-L 1.02, VideoMAE-L 1.10)
  • DAVIS J & F (segmentation): RVM-L 66.0% (DINOv2-L 61.7%, VideoMAE-L 54.3%)
  • VIP mIoU: RVM-L 38.0% (DINOv2-L 40.6%, VideoMAE-L 18.9%)

Average normalized performance for RVM-L/H is approximately 95% of each task's best model, compared to 82% for DINOv2 or VideoMAE large variants.

4.3 Small-Model Regime

RVM-S/B produce strong performance without distillation, outperforming or matching models up to 30x larger (SiamMAE, 4DS) via average normalized metric.

5. Qualitative Feature Analyses

5.1 Feature Visualizations

Unsupervised visualizations (PCA/RGB mapping and k-means clustering) demonstrate:

  • RVM feature embeddings align with semantically coherent, temporally stable object structures (foreground and background).
  • Competing models like VideoMAE and DINOv2 display temporal "flicker" and diminished object coherence under the same analysis.

5.2 Long-Horizon Feature Propagation

Label propagation experiments on DAVIS-2017 for sequences >80 frames show RVM features exhibit superior long-term segmentation accuracy retention compared to full-attention video models or frame-independent image models, indicating durable temporal information in the recurrent state.

6. Discussion and Directions

6.1 Notable Contributions

  • Recurrent video masked-autoencoding with transformer-based GRU aggregation of per-frame tokens.
  • Asymmetric masking: only history frames are available to the encoder; prediction is made for highly masked future frames, enforcing causal structure.
  • A "generalist" encoder suitable for both video-level tasks (e.g., action recognition, tracking) and image-level spatial tasks (e.g., depth estimation, correspondence).
  • Small-model parameter efficiency (%%%%31O(K)O(K)32%%%% over VideoMAE) achieved without knowledge distillation.
  • Linear memory/computation scaling and emergent stability over long temporal horizons.

6.2 Limitations

  • For short clips, recurrence may be less efficient than tube-based attention approaches, since no joint spatio-temporal patching occurs.
  • Backpropagation through time with full ViT steps presents high memory usage.
  • No saturation in data scaling observed up to 2 billion video training clips.

6.3 Future Research

  • Formalizing compute–data scaling laws for optimal resource allocation.
  • Extending RVM to multi-modal video (e.g., with audio or language) and embodied control.
  • Incorporating sparse or compressive recurrent updates for ultra-long contexts.
  • Combining RVM with generative or contrastive objectives to enhance representation quality.

RVM demonstrates that transformer-based recurrence, in combination with pixel-level reconstruction and an asymmetric masking scheme, can deliver both highly efficient and highly general visual representations from large-scale unlabeled video data (Zoran et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Recurrent Video Masked-Autoencoders (RVM).