TDN: Temporal Difference Networks for Efficient Action Recognition (2012.10071v2)

Published 18 Dec 2020 in cs.CV

Abstract: Temporal modeling still remains challenging for action recognition in videos. To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition. The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator, and systematically assess its effect on short-term and long-term motion modeling. To fully capture temporal information over the entire video, our TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation. TDN provides a simple and principled temporal modeling framework and could be instantiated with the existing CNNs at a small extra computational cost. Our TDN presents a new state of the art on the Something-Something V1 & V2 datasets and is on par with the best performance on the Kinetics-400 dataset. In addition, we conduct in-depth ablation studies and plot the visualization results of our TDN, hopefully providing insightful analysis on temporal difference modeling. We release the code at https://github.com/MCG-NJU/TDN.

View on arXiv

Authors (4)

Limin Wang (221 papers)
Zhan Tong (16 papers)
Bin Ji (28 papers)
Gangshan Wu (70 papers)

Citations (358)

View on Semantic Scholar

Summary

Temporal Difference Networks for Efficient Action Recognition

The paper "Temporal Difference Networks for Efficient Action Recognition" introduces a novel architecture called the Temporal Difference Network (TDN) aimed at enhancing action recognition from video data by focusing on temporal modeling. The authors propose an innovative approach leveraging a temporal difference module (TDM), which effectively captures both short-term and long-term motion information, improving the efficiency and accuracy of existing convolutional neural networks (CNNs) in video analysis.

Core Contributions

The primary contribution of TDN is the integration of a temporal difference operator into a unified framework that provides enhancements in temporal motion modeling at minimal computational cost. TDN employs a two-level difference modeling paradigm:

Short-Term Motion Modeling: Utilizes temporal differences over consecutive frames to enrich a 2D CNN with fine-grained motion patterns.
Long-Term Motion Modeling: Integrates temporal differences across video segments to capture extended motion structures and improve motion feature excitation.

Strong Numerical Results

The paper demonstrates TDN’s effectiveness on several benchmark datasets. Notably, TDN achieves state-of-the-art results on the Something-Something V1 and V2 datasets, which emphasize motion-centric action recognition, and matches leading performance on the Kinetics-400 dataset, which focuses on scene-based actions. The implementation details indicate a minimal increase in computational cost (around 9% increase in FLOPs) compared to baseline models without TDM.

Design and Advantages

TDN's design is modular, making it adaptable to existing CNN architectures. For instance, in a ResNet backbone setup, short-term TDMs are incorporated in early stages for local motion, and long-term TDMs are used in later stages for modeling broader temporal relationships. This allows a shift from traditional methods that rely heavily on computationally expensive 3D convolutions or extensive optical flow calculations.

The authors provide comprehensive ablation studies detailing the effectiveness of temporal difference operations and the complementary nature of short-term and long-term modules in enhancing action recognition performance. The analysis highlights TDM's superiority over existing temporal modeling techniques like TEINet and TSM by more effectively capturing motion dynamics with fewer FLOPs.

Implications and Future Directions

The work holds significant implications for the future of video action recognition. By providing a framework that improves motion modeling efficiency without substantially increasing computational overhead, TDN represents a practical alternative to current state-of-the-art methodologies reliant on massive data and computation resources. Its modular nature also suggests potential for broader application, from real-time systems to large-scale video content analysis.

Future developments in this domain may explore further optimization of TDMs, integration with other video processing innovations, and application across diverse video genres beyond action recognition. The proposed TDN architecture indicates a promising direction towards achieving balance between performance, accuracy, and computational efficiency in video analysis tasks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - MCG-NJU/TDN: [CVPR 2021] TDN: Temporal Difference Networks for Efficient Action Recognition (368 stars)