Graph Convolutional Networks for Temporal Action Localization (1909.03252v1)

Published 7 Sep 2019 in cs.CV

Abstract: Most state-of-the-art action localization systems process each action proposal individually, without explicitly exploiting their relations during learning. However, the relations between proposals actually play an important role in action localization, since a meaningful action always consists of multiple proposals in a video. In this paper, we propose to exploit the proposal-proposal relations using Graph Convolutional Networks (GCNs). First, we construct an action proposal graph, where each proposal is represented as a node and their relations between two proposals as an edge. Here, we use two types of relations, one for capturing the context information for each proposal and the other one for characterizing the correlations between distinct actions. Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization. Experimental results show that our approach significantly outperforms the state-of-the-art on THUMOS14 (49.1% versus 42.8%). Moreover, augmentation experiments on ActivityNet also verify the efficacy of modeling action proposal relationships. Codes are available at https://github.com/Alvin-Zeng/PGCN.

Citations (461)

View on Semantic Scholar

Summary

The paper introduces a novel method using graph convolutional networks to capture both intra-action and inter-action contexts in video proposals.
The paper employs dual GCNs to separately handle action classification and boundary refinement, achieving a mAP of 49.1% on THUMOS14.
The paper demonstrates practical gains in handling untrimmed videos and sets the stage for future work on scalable, real-time graph-based action recognition.

Graph Convolutional Networks for Temporal Action Localization

The paper introduces a novel approach leveraging Graph Convolutional Networks (GCNs) for enhancing temporal action localization in videos. Unlike conventional methods that handle action proposals independently, this work proposes a paradigm that models the relationships between proposals through GCNs, enabling the effective capture of contextual information crucial for recognizing and localizing actions accurately.

Methodology Overview

The authors develop a proposal graph where nodes represent action proposals within a video and edges denote the relations between these proposals. Two primary types of relations are defined: contextual edges, which capture associations within overlapping proposals, and surrounding edges, linking distinct yet temporally proximate proposals. This dual structure allows the network to encapsulate both intra-action and inter-action contexts.

Graph Convolutions are then utilized to aggregate information across these proposal graphs. Specifically, two separate GCNs are employed: one for action classification and the other for refining proposal boundaries. This bifurcation reflects the understanding that classification and localization, despite their interconnectedness, benefit from tailored processing pipelines.

Numerical Results and Impact

On the THUMOS14 dataset, the proposed approach achieved a mean Average Precision (mAP) of 49.1% at a threshold of 0.5 tIoU, marking a notable improvement over previous state-of-the-art methods which achieved 42.8%. Further experiments on the ActivityNet dataset corroborate the efficacy of the proposed model, with significant gains in mAP when compared to baseline results.

Theoretical and Practical Implications

The findings emphasize the importance of modeling proposal-proposal interactions in temporal action localization tasks. The use of GCNs here serves to strengthen the representation learning capability by integrating contextual cues that would otherwise be overlooked. Practically, this approach offers a more nuanced and precise method for dealing with untrimmed video data where background and overlapping actions frequently challenge existing models.

Future Directions

Building on this research, further explorations could focus on enhancing the scalability of the approach, given the computational complexities introduced by graph operations. Extending the framework to incorporate dynamic graph structures that adapt to varying video contexts or integrating with real-time processing capabilities are potential avenues. Furthermore, investigating how the insights from this paper could generalize to other action recognition datasets or adapt to analogous tasks in different domains could provide fertile ground for subsequent research.

Overall, the paper contributes a significant methodological advancement in the domain of temporal action localization, promising improved accuracy and robustness in action detection systems.

PDF Markdown