VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking (2303.16727v2)

Published 29 Mar 2023 in cs.CV and cs.LG

Abstract: Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner. The code and model is available at \url{https://github.com/OpenGVLab/VideoMAEv2}.

PDF Abstract

An Insightful Overview of "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking"

The paper "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking" proposes an innovative approach for scaling video masked autoencoders (VideoMAE) aimed at building effective video foundation models. This work is significant due to its emphasis on both computational efficiency and model scalability, introducing a dual masking strategy designed to enhance pre-training efficacy on large-scale video data.

Core Contributions

The authors extend the original VideoMAE by implementing a dual masking approach. This involves applying masking to both the encoder and decoder, allowing for reduced computational costs and memory consumption without compromising model performance. The encoder operates on a subset of visible video tokens while the decoder reconstructs another subset, which efficiently balances the workload between these two components.

Technical Innovations

Key technical contributions include:

Dual Masking Strategy: By masking both the encoder and decoder, the authors reduce computational overhead. They utilize tube masking for the encoder and running cell masking for the decoder. This design significantly decreases the memory used during training.
Progressive Training Paradigm: The research introduces a two-phase progressive training approach. Initially, the model is pre-trained on a diverse, multi-source unlabeled dataset. This is followed by post-pre-training on a mixed labeled dataset, enhancing the model’s adaptability to varied downstream tasks.
Scalability: The paper successfully scales VideoMAE to a billion-level parameter model (ViT-g), which is unprecedented in the domain of video transformers. The scaling in both model and data size facilitates superior performance across diverse video understanding benchmarks.

Numerical Results and Implications

The authors present strong numerical evidence demonstrating the model's state-of-the-art performance on several datasets. For instance, the VideoMAE V2 achieves 90.0% accuracy on Kinetics-400 and 77.0% on Something-Something V2, marking significant improvements over previous models.

These results imply significant practical and theoretical potential. From a practical perspective, the proposed framework allows for efficient model training at scales previously considered prohibitive. Theoretically, it opens up new avenues in the scalable architecture design of video models, especially with respect to leveraging masked video data.

Future Directions

The work sets a precedent for future developments in video foundation models by demonstrating an efficient path towards handling large data and model sizes. Given the constraints of current computational resources, further research could focus on exploring more efficient training techniques or architectures that reduce the computational footprint even further. Additionally, expanding this approach to other multimodal tasks could potentially enhance cross-modal learning efficiencies.

In conclusion, the paper by Wang et al. makes significant strides in the field of video masked autoencoders. By refining the VideoMAE framework with dual masking and progressive training strategies, they pave the way for future explorations in scalable video representation learning. This contribution is poised to influence both academic research and practical applications, especially in areas requiring robust video analysis and understanding.