ActionVLAD: Learning spatio-temporal aggregation for action classification (1704.02895v1)

Published 10 Apr 2017 in cs.CV

Abstract: In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as out-performs other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

Authors (5)

Rohit Girdhar (43 papers)
Deva Ramanan (152 papers)
Abhinav Gupta (178 papers)
Josef Sivic (78 papers)
Bryan Russell (36 papers)

Citations (441)

View on Semantic Scholar

Summary

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

The paper presents a novel approach, ActionVLAD, for representing video data in action classification tasks. It introduces an end-to-end trainable methodology that aggregates local convolutional features across the entire spatio-temporal domain of videos. By integrating two-stream convolutional neural networks (CNNs) with learnable spatio-temporal feature aggregation, the authors effectively address the challenge of long-term temporal structure modeling in video data.

Methodology

ActionVLAD builds upon the two-stream network architecture, which separates motion (optical flow) and appearance (RGB) information processing. The unique aspect of this work lies in its use of a spatio-temporal extension of the NetVLAD aggregation layer, dubbed ActionVLAD, which enables pooling of CNN features across the video’s temporal span. The authors explore various strategies for spatial and temporal pooling and the fusion of motion and appearance streams, determining that while joint spatio-temporal pooling is advantageous, it is beneficial to maintain separate representations for motion and appearance.

Strong Numerical Results

The proposed model outperforms the baseline two-stream networks and other similar architectures significantly. On the action classification benchmarks HMDB51, UCF101, and Charades, ActionVLAD demonstrated a 13% relative improvement over the base architecture. Such performance gains are attributed to the effective aggregation of information across space and time, which enriches the feature representation without overly complicating the model.

Implications

The implications of ActionVLAD are multifold. Practically, this model could enhance systems that depend on accurate action recognition, such as those in video editing, sports analytics, and human-robot interaction. Theoretically, the paper contributes to the understanding of spatio-temporal feature modeling, particularly in the context of CNN architectures applied to video data.

Future Developments

Future work may see explorations extending beyond appearance and motion streams by incorporating additional modalities like depth or audio. Additionally, as datasets continue to grow in diversity and size, opportunities arise for fully exploiting the trainable aspects of such architectures with richer data.

In conclusion, ActionVLAD provides a robust framework for action classification, leveraging spatio-temporal aggregation to significantly outperform conventional methods. Its end-to-end trainability positions it as a promising tool for continued advancements in video action recognition.

PDF Markdown

Related Papers

Find Related Papers