UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning (2201.04676v3)

Published 12 Jan 2022 in cs.CV

Abstract: It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatiotemporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1 accuracy respectively. Code is available at https://github.com/Sense-X/UniFormer.

PDF Abstract

Overview of UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

The paper introduces UniFormer, a novel framework designed to advance spatiotemporal representation learning in video understanding. The work addresses key challenges such as local redundancy and global dependency inherent in video data by seamlessly combining 3D convolutional neural networks (CNNs) and vision transformers.

Key Contributions

Integration of 3D Convolution and Transformers: UniFormer effectively leverages the strengths of both 3D CNNs, which manage local redundancy, and transformers, which capture global dependencies. This balance aims to improve efficiency and accuracy in processing videos.
Dynamic Position Embedding (DPE) and Multi-Head Relation Aggregator (MHRA): The UniFormer architecture comprises three key components:
- DPE: Incorporates spatiotemporal order to enhance token representation across videos.
- MHRA: Distinctive in its design, MHRA replaces the conventional self-attention mechanism. It uses local token affinity in shallow layers to handle redundancy and global affinity in deeper layers to address dependencies.
Hierarchical Structure: The paper proposes a hierarchy wherein local MHRA is used in initial layers to save computation, while deeper layers utilize global MHRA to learn extensive token relationships.

Experimental Validation

Extensive experiments conducted on benchmark datasets such as Kinetics-400, Kinetics-600, and Something-Something V1&V2 demonstrate UniFormer’s capabilities:

On Kinetics-400, UniFormer achieved a top-1 accuracy of 82.9% with significantly fewer GFLOPs compared to existing models.
On Something-Something V2, it reached 71.2% top-1 accuracy, a state-of-the-art result.

Implications

Practical Implications

The architecture’s reduced computational requirements — requiring 10 times fewer GFLOPs than alternatives — suggests its potential for real-world applications where computational resources and efficiency are pivotal.

Theoretical Implications

UniFormer contributes to the broader field of video understanding by proposing a framework that effectively unifies two traditionally separate methodologies — convolutional operations and transformer-based attention mechanisms. This integration could spur further research into combinatory models for other complex visual tasks.

Future Directions

The research prompts several avenues for future exploration:

Enhanced Efficiency: Continued efforts in optimizing MHRA for even greater efficiency could further expand UniFormer’s applicability in resource-constrained environments.
Broader Applications: Extending the framework’s application beyond video classification to tasks like video generation or real-time analytics could be fruitful.
Comparative Studies: Further studies comparing the depth and breadth of spatiotemporal learning achieved by UniFormer relative to emerging models will solidify its standing within the community.

In conclusion, UniFormer’s innovative approach effectively balances computational efficiency and representational power, marking a substantial contribution to the field of video understanding and offering a robust platform for future research and development. Its demonstration of superior performance on standard benchmarks strengthens its relevance and potential for broader application in AI tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yali Wang (78 papers)
Peng Gao (401 papers)
Guanglu Song (45 papers)
Yu Liu (784 papers)
Hongsheng Li (340 papers)
Yu Qiao (563 papers)
KunChang Li (43 papers)

Citations (207)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Sense-X/UniFormer: [ICLR2022] official implementation of UniFormer (801 stars)