UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer (2211.09552v1)

Published 17 Nov 2022 in cs.CV

Abstract: Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. Code will be available at https://github.com/OpenGVLab/UniFormerV2.

PDF Abstract

UniFormerV2: Enhancing Spatiotemporal Learning with Video-Aware Vision Transformers

The paper introduces UniFormerV2, a significant advancement in the field of video understanding, by integrating pretrained Vision Transformers (ViTs) with efficient spatiotemporal components to create a highly capable family of video networks. This approach leverages the robustness and generalization strengths of ViTs, pretrained on vast image datasets, and the specialized architectural elements of the UniFormer framework, tailored for video analysis.

Core Contributions

UniFormerV2 addresses several limitations faced by earlier video models. Vision Transformers are proficient at capturing long-range dependencies in visual tasks through self-attention mechanisms but often struggle with local redundancies inherent in video data. Traditional models require complex preliminary training phases on image data before transitioning to video, which limits their practical applicability. The UniFormer model introduced more efficient hybrid methods by integrating convolutional operations. However, the requirement for exhaustive pretraining remained a barrier. UniFormerV2 mitigates these issues with a novel design strategy that seamlessly integrates local and global relation aggregators into pretrained ViTs, resulting in superior performance on multiple video benchmarks.

Numerical Performance on Benchmarks

UniFormerV2 demonstrates state-of-the-art performance across a plethora of standard benchmarks. On the Kinetics-400 dataset, UniFormerV2 achieves a notable top-1 accuracy of 90%, marking a significant milestone in video classification tasks. The proposed model also exhibits robust performance on other datasets, including Kinetics-600/700, Moments in Time, Something-Something V1/V2, ActivityNet, and HACS. This superior accuracy is achieved without excessive computational complexity, evidenced by a balanced approach to accuracy and floating point operations (FLOPs).

Methodological Innovations

UniFormerV2 introduces a multi-stage fusion of local and global UniBlocks that enhances the model's ability to learn both detailed and holistic spatiotemporal representations. The local UniBlock integrates a local temporal Multi-Head Relation Aggregator (MHRA) before the standard ViT block, optimizing temporal redundancy reduction while leveraging pretrained spatial representations. The global UniBlock facilitates comprehensive spatiotemporal modeling via a cross-attention mechanism with a learnable query, which efficiently aggregates tokens across all frames into a condensed video representation. This strategic layering allows the model to efficiently process large-scale inputs with reduced computational demands.

Broader Implications

UniFormerV2's architecture reflects a forward-thinking approach in AI, particularly in video-processing efficiency. By marrying existing robust ViT models with new efficient video-specific designs, it sets a precedent for future research on scalable video understanding systems. The framework's design principles could generalize to other domains where data redundancy presents computational challenges, such as real-time video streaming or extensive surveillance systems.

Speculations on Future Work

Future research could explore scaling UniFormerV2 with larger and more diverse datasets or integrate multi-modal data for enhanced context comprehension. Moreover, its modular design invites exploration of other neural architectures that can further optimize or specialize subcomponents for diverse applications, such as anomaly detection or autonomous vehicle navigation.

In summary, UniFormerV2 leverages the strengths of pretrained ViTs and efficient video-centric architectural designs to substantially advance the capabilities of spatiotemporal video representation learning. It stands as a pivotal development in video analytics, offering a scalable, performant framework for both contemporary and future challenges in the field.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yali Wang (78 papers)
Yinan He (34 papers)
Yizhuo Li (21 papers)
Yi Wang (1038 papers)
Limin Wang (221 papers)
Yu Qiao (563 papers)
KunChang Li (43 papers)

Citations (88)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - OpenGVLab/UniFormerV2: [ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer (319 stars)