BEVT: BERT Pretraining of Video Transformers (2112.01529v3)

Published 2 Dec 2021 in cs.CV and cs.LG

Abstract: This paper studies the BERT pretraining of video transformers. It is a straightforward but worth-studying extension given the recent success from BERT pretraining of image transformers. We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. In particular, BEVT first performs masked image modeling on image data, and then conducts masked image modeling jointly with masked video modeling on video data. This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i.e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations. We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results. On Kinetics 400, for which recognition mostly relies on discriminative spatial representations, BEVT achieves comparable results to strong supervised baselines. On Something-Something-V2 and Diving 48, which contain videos relying on temporal dynamics, BEVT outperforms by clear margins all alternative baselines and achieves state-of-the-art performance with a 71.4\% and 87.2\% Top-1 accuracy respectively. Code will be made available at \url{https://github.com/xyzforever/BEVT}.

PDF Abstract

An Overview of BEVT: BERT Pretraining of Video Transformers

The paper, "BEVT: BERT Pretraining of Video Transformers," introduces a novel approach for pretraining video transformers using a BERT-inspired framework. The central innovation of BEVT is its decoupled design, which separately focuses on spatial representation learning and temporal dynamics learning for video data. This strategy effectively leverages the success of BERT pretraining from NLP and image transformers, enhancing video understanding tasks.

Key Insights and Methodology

BEVT builds on the observation that spatial priors learned from image datasets can ease the training of computationally intensive video transformers, which have traditionally relied on large-scale datasets when trained from scratch. Moreover, the spatial and temporal information required for accurate video recognition varies significantly across different videos due to large intra-class and inter-class variations.

The BEVT framework operates with a two-stream design:

Image Stream: This stream applies masked image modeling (MIM) on images, leveraging the latent codes derived from a pretrained visual tokenizer. It thus facilitates learning spatial representations conducive to recognizing videos with more static content.
Video Stream: Building on the spatial priors from image pretraining, this stream jointly performs masked image and video modeling (MVM) on video clips, capturing temporal dynamics essential for understanding movement across video frames.

The training process begins with the image stream pretraining on ImageNet to acquire spatial features. These pretrained weights then initialize the video stream, allowing for efficient and effective learning of video representations. The two streams share model weights for most layers, excluding model-specific components, effectively balancing spatial and temporal clue extraction.

Experimental Validation and Results

BEVT was empirically tested on three challenging video recognition datasets: Kinetics-400, Something-Something V2, and Diving-48. The experiments demonstrate its efficacy across diverse scenarios:

On the Kinetics-400 dataset, BEVT matched state-of-the-art supervised baselines with an 81.1% Top-1 accuracy, showing that incorporating spatial priors is sufficient for datasets relying heavily on spatial information.
For the Something-Something V2 and Diving-48 datasets, which demand greater temporal comprehension, BEVT outperformed alternative approaches with 71.4% and 87.2% Top-1 accuracy, respectively.

The paper highlights the versatility of BEVT, showing that it can robustly learn from both image and video tasks, and notably enhance video understanding by integrating spatial and temporal learning requirements.

Implications and Future Directions

BEVT's decoupled learning paradigm for video transformers lays a strong foundation for forthcoming research, which could explore:

Extending BEVT to encompass multimodal datasets, incorporating verbal and visual elements to enrich video understanding.
Refinements in visual tokenization techniques, further bridging the gap between linguistic and visual pretraining frameworks.
Reducing computational overhead while maintaining improved performance. Techniques, such as disregarding masked tokens during computations, might be applied to further streamline the pretraining process.

In conclusion, BEVT represents a significant stride in leveraging NLP's BERT objectives within video transformer pretraining. It skillfully decouples the learning of spatial and temporal dimensions, thus bringing enhanced efficiency and efficacy in representing complex video data. Future work will likely refine and expand BEVT's capabilities, driving further advances in AI video understanding tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Rui Wang (996 papers)
Dongdong Chen (164 papers)
Zuxuan Wu (144 papers)
Yinpeng Chen (55 papers)
Xiyang Dai (53 papers)
Mengchen Liu (48 papers)
Yu-Gang Jiang (223 papers)
Luowei Zhou (31 papers)
Lu Yuan (130 papers)

Citations (186)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - xyzforever/BEVT: PyTorch implementation of BEVT (CVPR 2022) https://arxiv.org/abs/2112.01529 (157 stars)