Token Shift Transformer for Video Classification (2108.02432v1)

Published 5 Aug 2021 in cs.CV and cs.MM

Abstract: Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a plain 2D vision transformer for learning 3D video representation. It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding. Experiments on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly, with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets, comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced in: https://github.com/VideoNetworks/TokShift-Transformer.

Authors (3)

Hao Zhang (948 papers)
Yanbin Hao (31 papers)
Chong-Wah Ngo (55 papers)

Citations (103)

View on Semantic Scholar

Summary

The paper presents TokShift, a novel zero-parameter module that efficiently shifts temporal tokens to capture critical video dynamics.
It integrates temporal token shifts within transformer encoders to bypass costly self-attention operations for processing high-dimensional video data.
The approach achieves competitive performance on datasets like Kinetics-400, recording 79.83% precision with 8-frame inputs.

Token Shift Transformer for Video Classification

The paper "Token Shift Transformer for Video Classification" introduces a novel approach to video classification using transformers effectively adapted for 3D video data. The paper presents the Token Shift Module (TokShift), a zero-parameter, zero-FLOPs operator, that enhances the ability of transformers to process the temporal dynamics inherent in video sequences without the computational burden typically imposed by self-attention mechanisms in standard transformer architectures.

Overview of the Approach

The authors begin by acknowledging the challenges in extending the success of transformers, which have been remarkable in NLP and image understanding, to complex, high-dimensional video signals. The standard transformer architecture includes computationally intensive operations such as pair-wise self-attention, which are less suitable for video processing due to the intrinsic requirement to handle longer sequential data.

TokShift focuses on shifting temporal relations of [Class] tokens across adjacent frames within each transformer encoder. This approach selectively modifies only the temporal aspect of global features, maintaining computational efficiency while capturing dynamic content critical to video analysis. The method bypasses the need for convolutional processes and introduces temporal modeling purely through transformers.

Experimentation & Results

The paper provides empirical evidence of TokShift's efficacy on standard video datasets such as Kinetics-400, EGTEA-Gaze+, and UCF-101. The results demonstrate TokShift's robustness and efficiency, achieving top accuracy scores comparable or superior to state-of-the-art convolutional models. Notably, TokShift transformer achieves precision levels of 79.83\% with 8 frame inputs on Kinetics-400, showcasing the proposed transformer model's competitiveness.

Implications and Future Directions

This paper carries significant implications regarding the future of video classification through transformers. The adoption of TokShift in transformers prompts further exploration into the complete bypass of convolutional operations in video understanding tasks. Moreover, the paper leaves room for future research focusing on computational optimizations specific to transformers dealing with long-length and high-dimensional video data, potentially impacting areas like real-time video analysis and interpretation.

Theoretically, the TokShift technique relies on an understanding of video as sequences of global representations, aligning with principles in video linguistics interpretation. The insights offered by this approach could pave the way for new methodologies in not only video classification but also other fields where temporal dynamics are essential.

In summary, the implications of TokShift's results resonate with the continual pursuit of efficient deep learning frameworks in multimedia analysis, extending the capabilities inherent in 2D image transformers to dynamic video environments—a step forward in scalable digital content understanding.

Related Papers

GitHub

GitHub - VideoNetworks/TokShift-Transformer (69 stars)