Masked Autoencoders As Spatiotemporal Learners (2205.09113v2)

Published 18 May 2022 in cs.CV and cs.LG

Abstract: This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.

PDF Abstract

Masked Autoencoders as Spatiotemporal Learners

This paper presents an extension of Masked Autoencoders (MAE) for spatiotemporal representation learning, specifically targeting video data. The authors introduce a method where random spacetime patches in videos are masked and subsequently reconstructed using an autoencoder. This approach employs minimal inductive bias concerning spacetime, relying only on patch and positional embeddings, thus positioning itself as agnostic to the inherent structures in video data.

Key Observations and Methodology

High Masking Ratio: The paper identifies an optimal masking ratio of 90% for video data, higher than the 75% used for images. This discovery is linked to the higher information redundancy inherent in video data, which often involves temporal coherence across frames.
Efficiency Gains: The method leverages this high masking ratio to achieve significant computational efficiency. Specifically, the MAE framework can reduce the encoder's computational load by over 90%, resulting in substantial speed-ups (over 4 times in wall-clock time) during training.
Competitive Performance: The approach achieves competitive performance across several challenging video datasets, such as Kinetics-400. Notably, the MAE method surpasses supervised pre-training by significant margins, underlining its efficacy in extracting meaningful representations without heavy reliance on domain-specific knowledge.
Real-World Application: The paper also reports successful training on uncurated real-world data from Instagram, demonstrating that the MAE approach scales effectively to diverse and uncurated datasets, offering promising potential for generalized video representation learning.

Experimental Analysis

Masking Strategy: Through rigorous experimentation, the paper concludes that a spacetime-agnostic masking strategy yields superior performance compared to spatial-only or temporal-only strategies. This finding highlights the robustness of treating video data uniformly rather than segmenting it along space or time.
Decoder Architecture: Unlike image-based MAE counterparts, the video-based variant benefits from a slightly more substantial decoder architecture to handle the complexities of video data adequately.
Pre-training and Transfer Learning: Evaluations show that MAE pre-training on video datasets results in substantial gains on downstream tasks, outperforming even substantial supervised datasets. Moreover, the strategy remains effective across a range of architectures, showcasing its versatility.

Implications and Future Directions

The research presented demonstrates that MAE can serve as a unifying methodology across modalities, reflecting a broader trend toward generalized learning frameworks across data types, such as language and vision. This approach's ability to handle videos with minimal inductive bias suggests significant implications for unsupervised and self-supervised learning paradigms. By minimizing the dependency on domain-specific architectures, this work opens avenues for creating more flexible and universal models capable of tackling diverse data types without extensive specialization.

Future exploration could involve extending this methodology to other spatiotemporal data representations beyond video, such as 3D point clouds and medical imaging, to continue the push towards general-purpose AI frameworks. Additionally, there is potential in scaling this method with ever-larger datasets, akin to developments in the language domain, which could yield insights into the latent structures within spatiotemporal data.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Christoph Feichtenhofer (52 papers)
Haoqi Fan (33 papers)
Yanghao Li (43 papers)
Kaiming He (71 papers)

Citations (420)

View on Semantic Scholar

Masked Autoencoders As Spatiotemporal Learners (2205.09113v2)