VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training (2203.12602v3)

Published 23 Mar 2022 in cs.CV

Abstract: Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.

PDF Abstract

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Transformers have undeniably revolutionized various domains in computer science, particularly in NLP and computer vision (CV). When applied to video data, transformer models have demonstrated substantial promise; however, the need for large-scale datasets and computational resources often limits their efficacy. This paper introduces VideoMAE, a self-supervised video pre-training method that leverages masked autoencoders to attain data-efficient learning for video transformers.

Key Contributions and Findings

The authors highlight several noteworthy insights about VideoMAE:

High Masking Ratio: VideoMAE exhibits favorable performance even with an extremely high masking ratio (90%-95%). The inherent temporal redundancy in video data allows the model to effectively handle such substantial masking, contrasting with image data where lower masking ratios are typically more effective.
Small Dataset Efficiency: The model achieves impressive results with minimal data. Even on small datasets (e.g., 3k-4k videos) without resorting to extra data, VideoMAE demonstrates effective learning through the challenging task of video reconstruction, which enforces high-level structure learning.
Emphasis on Data Quality: The research underscores the importance of data quality over quantity for self-supervised video pre-training. The experiments reveal that domain shift between the pre-training and target datasets is a critical factor, emphasizing data quality's role in ensuring effective learning.

Methodological Innovations

VideoMAE introduces strategic modifications to the masked autoencoder approach tailored specifically for video data:

Temporal Downsampling: Videos often feature temporal redundancy with slowly varying semantics over consecutive frames. The authors exploit this by employing strided temporal sampling, thereby focusing on critical frames to enhance pre-training efficiency and effectiveness.
Cube Embedding: Instead of handling 2D image patches, VideoMAE employs joint space-time cube embedding, where each 3D token represents a small spatiotemporal segment of the video, facilitating better spatiotemporal representation learning.
Tube Masking Strategy: To mitigate the risk of temporal correlation leaking information during reconstruction, VideoMAE uses a tube masking approach, which ensures that corresponding spatiotemporal content is masked across all frames, necessitating higher-level semantic reconstruction by the transformer.

Experimental Results

The empirical evaluations conducted across multiple datasets present strong numerical outcomes, asserting VideoMAE's efficacy:

Performance on Small Datasets: The model achieves 91.3% and 62.6% top-1 accuracy on UCF101 and HMDB51 respectively—datasets with considerably fewer samples compared to large-scale datasets like Kinetics-400.
Transfer Learning: VideoMAE's transferability is validated by impressive performance when pre-trained on Kinetics-400 and fine-tuned on UCF101 (96.1%) and HMDB51 (73.3%).
Action Detection Task: When applied to the AVA action detection dataset, VideoMAE demonstrates strong performance, securing up to 39.5 mAP with the ViT-Huge model pre-trained on Kinetics-400, further emphasizing its generalization potential.

Implications and Future Directions

This research underscores significant theoretical and practical implications. From a theoretical perspective, VideoMAE illustrates that leveraging temporal redundancy and correlation in video data can result in efficient self-supervised learning, even with a transformer architecture traditionally reliant on large-scale data. Practically, the findings advocate for the broader application of video transformer models across domains with limited data resources, potentially democratizing high-performance video analysis technologies.

Future research could explore several avenues to extend the findings of this paper:

Larger Models and Datasets: Scaling up to larger models (e.g., ViT-G) and incorporating larger datasets could further enhance the representational capacity and performance of VideoMAE.
Multimodal Integration: Incorporating additional modalities such as audio and textual information alongside video data could enrich the self-supervised learning experience and potentially improve outcomes on tasks requiring multimodal comprehension.
Enhanced Masking Strategies: Investigating alternate or adaptive masking strategies tailored for various forms of video content could further optimize model performance and generalizability.

Conclusion

In summary, the presented VideoMAE approach exemplifies a data-efficient strategy for pre-training video transformers via self-supervised learning, leveraging high masking ratios and tube masking to exploit temporal redundancies and correlations effectively. This research opens new pathways for training robust video analysis models with limited data, fostering advancements in both academia and industry applications.