TEINet: Towards an Efficient Architecture for Video Recognition (1911.09435v1)

Published 21 Nov 2019 in cs.CV

Abstract: Efficiency is an important issue in designing video architectures for action recognition. 3D CNNs have witnessed remarkable progress in action recognition from videos. However, compared with their 2D counterparts, 3D convolutions often introduce a large amount of parameters and cause high computational cost. To relieve this problem, we propose an efficient temporal module, termed as Temporal Enhancement-and-Interaction (TEI Module), which could be plugged into the existing 2D CNNs (denoted by TEINet). The TEI module presents a different paradigm to learn temporal features by decoupling the modeling of channel correlation and temporal interaction. First, it contains a Motion Enhanced Module (MEM) which is to enhance the motion-related features while suppress irrelevant information (e.g., background). Then, it introduces a Temporal Interaction Module (TIM) which supplements the temporal contextual information in a channel-wise manner. This two-stage modeling scheme is not only able to capture temporal structure flexibly and effectively, but also efficient for model inference. We conduct extensive experiments to verify the effectiveness of TEINet on several benchmarks (e.g., Something-Something V1&V2, Kinetics, UCF101 and HMDB51). Our proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efficiency.

PDF Abstract

Overview of TEINet: Towards an Efficient Architecture for Video Recognition

The paper "TEINet: Towards an Efficient Architecture for Video Recognition" presents a novel approach to enhancing the efficacy and efficiency of video action recognition models. The proposed architecture, TEINet, is designed to leverage the strengths of 2D CNNs while incorporating an innovative temporal enhancement-and-interaction mechanism to address the inefficiencies associated with 3D convolutional networks.

Key Contributions

This research introduces the Temporal Enhancement-and-Interaction (TEI) module, which serves as a flexible temporal modeling component capable of integration into existing 2D CNNs. The TEI module is built upon two distinct parts:

Motion Enhanced Module (MEM): MEM focuses on emphasizing motion-related features by utilizing temporal differences to generate channel-level attention weights. This mechanism enhances the salient motion components while suppressing background noise, thus improving the model's focus on moving objects or subjects within the video frames.
Temporal Interaction Module (TIM): After enhancing motion-related features, TIM captures local temporal interactions using a channel-wise convolutional approach. This module addresses the need for capturing temporal evolution and contextual information across consecutive frames, key for effective video action recognition.

Experimental Evaluation

The TEINet is evaluated against several benchmarks, notably Something-Something V1, V2, Kinetics, UCF101, and HMDB51 datasets. The paper demonstrates that TEINet not only achieves superior action recognition accuracy compared to existing methods like TSM (Temporal Shift Module) and I3D (Inflated 3D Convolution Networks) but also maintains computational efficiency. Notably, on the Something-Something datasets, which prioritize motion over static appearance, TEINet delivers state-of-the-art performance.

Further, the research explores the integration of the TEI module at different stages within the ResNet-50 architecture, offering insights into the optimal balance between computational cost and accuracy improvement. Specifically, it was discerned that inserting TEI modules into later stages (res $_{4-5}$ ) of ResNet provided a notable performance boost with minimal extra computational overhead.

Theoretical and Practical Implications

The implications of this work are twofold. Practically, TEINet supports efficient deployment in real-world applications requiring video processing with limited computational resources, such as mobile and embedded devices. Theoretically, the decoupled design of the TEI module serves as a modular paradigm that can be extended or refined for other temporal information processing tasks, providing a foundation for further research in efficient video understanding systems.

Future Directions

Future research may focus on expanding the application of TEINet to other domains of video analytics, such as temporal sequence prediction and anomaly detection. Additionally, exploring the integration of TEINet with emerging neural architectures could yield further improvements in both performance and efficiency, potentially guiding the development of next-generation video recognition systems.

TEINet illustrates a successful fusion of 2D CNN efficiency with the dynamic temporal modeling needed for video understanding, showcasing a significant step forward in the field of video action recognition.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Zhaoyang Liu (42 papers)
Donghao Luo (34 papers)
Yabiao Wang (93 papers)
Limin Wang (221 papers)
Ying Tai (88 papers)
Chengjie Wang (178 papers)
Jilin Li (41 papers)
Feiyue Huang (76 papers)
Tong Lu (85 papers)

Citations (224)

View on Semantic Scholar