TEA: Temporal Excitation and Aggregation for Action Recognition (2004.01398v1)

Published 3 Apr 2020 in cs.CV

Abstract: Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short- and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.

Citations (415)

View on Semantic Scholar

Summary

The paper presents the TEA block that integrates motion excitation and multiple temporal aggregation to capture both short- and long-range temporal dynamics in video data.
The motion excitation module leverages feature-level temporal differences to enhance motion-sensitive channels without relying on expensive optical flow computations.
The multiple temporal aggregation module employs a hierarchical residual framework to extend the receptive field, achieving state-of-the-art performance on several action recognition benchmarks.

Temporal Excitation and Aggregation for Action Recognition

In this paper, the authors present the Temporal Excitation and Aggregation (TEA) block, a novel architectural component designed for video action recognition by capturing both short- and long-range temporal dynamics. The proposed block consists of two primary modules: the Motion Excitation (ME) module and the Multiple Temporal Aggregation (MTA) module. The efficacy of the TEA block is validated on multiple action recognition benchmarks, including Kinetics, Something-Something, HMDB51, and UCF101, achieving impressive results with low computational costs.

Motion Excitation Module

The ME module addresses short-range motion dynamics by integrating motion modeling directly into the spatiotemporal feature learning pipeline. Traditional approaches in action recognition have relied on hand-crafted descriptors such as optical flow to capture motion, which often result in increased computational costs and limited integration with spatial feature learning. In contrast, the ME module calculates feature-level temporal differences between adjacent frames to identify and enhance motion-sensitive channels in the feature maps. This is accomplished by encoding motion representations through learned modulation weights, which are then used to excite the motion-relevant channels via a residual connection, ensuring that background scene information is also preserved.

Multiple Temporal Aggregation Module

The MTA module enhances long-range temporal modeling by performing multiple stages of temporal aggregation without additional computational overhead. Instead of employing deep stacks of local convolutions to approximate long-range dependencies, the MTA module modifies the standard convolutional operations into a set of sub-convolutions arranged in a hierarchical residual framework. This method effectively enlarges the receptive field, allowing the model to capture interactions over distant frames more effectively. The hierarchical structure also aids in seamless optimization by providing shorter paths for gradient backpropagation.

Comprehensive Evaluation and Implications

Experimentally, the TEA network is shown to outperform several state-of-the-art techniques on action recognition datasets and achieves efficient computation with only a moderate increase in FLOPs compared to standard 2D ResNet baselines. For instance, on the Something-Something V1 dataset, the TEA block yields a significant performance boost over traditional 2D CNNs without relying on the heavy computational costs of 3D convolutions.

From a theoretical standpoint, the TEA block provides a compelling framework that integrates motion dynamics directly into network architectures, eschewing the traditional divide between spatial and temporal feature modeling. Practically, this translates to more resilient models capable of understanding complex temporal structures in video data while maintaining computational efficiency, critical for real-time applications such as surveillance or autonomous driving.

Future Directions

The promising results achieved by the TEA block imply several potential future research directions. One avenue is the exploration of its integration with transformer-based architectures, which have recently shown efficacy in various computer vision tasks due to their capacity for capturing long-range dependencies. Another promising direction is adapting the TEA block for multi-modal video tasks, such as combining audio and text-based data streams, to further enrich action understanding in diverse application scenarios.

In summary, the TEA block presents a sophisticated and computationally efficient solution for enhancing temporal modeling in video action recognition, bridging existing gaps between spatial and temporal feature integration. Its innovative design and empirical success underscore its potential as a foundational module in future video recognition systems.

PDF Markdown