CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

Published 8 Feb 2025 in cs.CV | (2502.07811v1)

Abstract: Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook nuanced semantic attributes like specific interactions or sequences that define actions - such as action-specific features that align more closely with human cognition for space-time correspondence. This can limit the model's ability to capture the essence of certain actions that are contextually rich and continuous. Humans are capable of mapping visual concepts, object view invariance, and semantic attributes available in static instances to comprehend natural dynamic scenes or videos. Existing MAEs for videos and static images rely on separate datasets for videos and images, which may lack the rich semantic attributes necessary for fully understanding the learned concepts, especially when compared to using video and corresponding sampled frame images together. To this end, we propose CrossVideoMAE an end-to-end self-supervised cross-modal contrastive learning MAE that effectively learns both video-level and frame-level rich spatiotemporal representations and semantic attributes. Our method integrates mutual spatiotemporal information from videos with spatial information from sampled frames within a feature-invariant space, while encouraging invariance to augmentations within the video domain. This objective is achieved through jointly embedding features of visible tokens and combining feature correspondence within and across modalities, which is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner. Extensive experiments demonstrate that our approach surpasses previous state-of-the-art methods and ablation studies validate the effectiveness of our approach.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes a dual-branch MAE framework that integrates video and image modalities using both intra-modal and cross-modal contrastive learning.
It employs high masking ratios (90–95% for video, 75–90% for image) and semantic distillation to capture nuanced spatiotemporal and contextual features.
Empirical results show state-of-the-art performance on benchmarks like SSv2, K400, UCF101, and HMDB51, highlighting its efficacy for action recognition.

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

Motivation and Context

CrossVideoMAE addresses a critical limitation in video-based Masked Autoencoders (MAEs): the inability to fully capture nuanced semantic attributes and spatiotemporal correspondences necessary for robust action recognition. Existing MAEs tend to focus on generic spatial-temporal patterns, often missing contextually rich and continuous action-specific features. The proposed framework leverages both video sequences and sampled static frames, integrating mutual spatiotemporal information and semantic knowledge in a feature-invariant space. This joint intra-modal and cross-modal contrastive learning paradigm is designed to enforce invariance to augmentations and facilitate effective semantic knowledge distillation, thereby enhancing the quality of learned representations for downstream tasks.

Methodology

Architecture Overview

CrossVideoMAE employs a dual-branch architecture: a video branch and an image branch, both based on ViT-B/16. The video branch utilizes SpatioTemporalMAE for intra-modal pre-training, while the image branch leverages a pre-trained MAE for cross-modal semantic distillation. Both branches operate with high masking ratios (90–95% for video, 75–90% for image), which is shown to be optimal for representation learning.

Figure 1: The CrossVideoMAE framework with video and image branches, illustrating intra-modal and cross-modal contrastive learning objectives at both video and frame levels.

The model is pre-trained jointly across video and image domains using a combination of intra-modal and cross-modal contrastive learning objectives at both the video and frame levels. For downstream tasks, only the video branch encoder is retained.

Intra-modal contrastive learning is applied at both video and frame levels to enforce view-invariant representations. For each video, two views (raw and augmented) are generated via spatiotemporal and spatial augmentations. The encoder projects these views into a feature-invariant space, and the NT-Xent loss is used to maximize similarity between positive pairs and minimize similarity with negatives within the batch. Frame-level embeddings are sampled from video-level embeddings to capture temporal variations.

Cross-modal contrastive learning aligns feature embeddings between video sequences and their corresponding sampled frames. The image encoder projects sampled frames into the invariant space, and the model maximizes cosine similarity between video and frame-level features. This strategy compels the model to learn from more challenging positive and negative samples, enhancing representation capability beyond intra-modal alignment.

Reconstruction and MSE Losses

Reconstruction and MSE losses are employed to ensure fidelity in the decoded outputs and to improve the feature embedding of visible tokens. The MSE loss is computed between the decoder outputs of original and augmented videos, while the reconstruction loss is the mean squared error between the predicted and target representations for both views.

Joint Objective

The overall pre-training objective is a weighted sum of intra-modal and cross-modal contrastive losses, reconstruction loss, and MSE loss:

$\mathcal{L} = \lambda_c \times (\mathcal{L}_{\text{intra}} + \mathcal{L}_{\text{cross}}) + \mathcal{L}_{\text{rl}} + \mathcal{L}_{\text{mse}}$

Implementation Details

Backbone: ViT-B/16, shared weights, ~87M parameters.
Input: 16 frames per video, 224×224 resolution, patch size 2×3×16×16, yielding 1568 tokens.
Masking: Random masking, optimal at 90–95% for video, 75–90% for image.
Augmentations: Random resizing, cropping, horizontal flipping, random erasing, mixup, cutmix.
Optimization: AdamW, batch size 32, 8 GPUs, decoder depth 4.
Test-Time Adaptation (TTA): 20 gradient updates per batch during inference, reducing GPU memory and pre-training time.

Empirical Results

CrossVideoMAE achieves state-of-the-art performance on UCF101, HMDB51, Kinetics-400 (K400), and Something-Something V2 (SSv2) datasets. Notably, it outperforms previous methods in both full fine-tuning and linear evaluation settings, with the most significant gains on SSv2 due to effective semantic alignment from sampled frames.

SSv2 Acc@1: 73.7% (best in class)
K400 Acc@1: 83.2% (best in class)
UCF101 Acc@1: 97.6% (best in class)
HMDB51 Acc@1: 78.4% (best in class)

Ablation studies confirm that joint intra-modal and cross-modal contrastive learning, high masking ratios, and random masking strategies are critical for optimal performance. Increasing decoder depth up to four blocks improves accuracy, with diminishing returns beyond that.

Figure 2: Self-attention maps visualization of CrossVideoMAE, demonstrating effective learning of spatiotemporal and semantic representations across original, masked, and reconstructed frames.

Figure 3: Example self-attention maps on K400, highlighting the model's focus on action-relevant regions.

Figure 4: Example self-attention maps on SSv2, showing robust semantic attribute extraction under high masking ratios.

Qualitative Analysis

Self-attention map visualizations reveal that CrossVideoMAE consistently attends to semantically relevant regions, such as hands and objects involved in actions, even under aggressive masking. The model demonstrates strong invariance to augmentations and effective semantic knowledge transfer from sampled frames to videos.

Practical Implications and Future Directions

CrossVideoMAE's architecture and training paradigm enable efficient, label-free video representation learning with reduced computational resources. The framework is highly transferable to downstream tasks such as action recognition, video retrieval, and potentially video captioning. The reliance on sampled frames for semantic distillation suggests future work could explore automated frame selection strategies or extend the approach to other modalities (e.g., audio, text).

The demonstrated gains in representation quality and resource efficiency position CrossVideoMAE as a robust foundation for scalable video understanding systems. Further research may investigate its integration with multimodal transformers, domain adaptation, and real-time inference in resource-constrained environments.

Conclusion

CrossVideoMAE introduces a principled approach to self-supervised image-video representation learning, leveraging joint intra-modal and cross-modal contrastive objectives, high masking ratios, and semantic knowledge distillation from sampled frames. The method achieves superior performance across multiple benchmarks, validating its effectiveness for spatiotemporal and semantic representation learning. The framework's design and empirical results suggest promising avenues for future research in efficient, scalable video understanding.