Multi-granularity Generator for Temporal Action Proposal (1811.11524v2)

Published 28 Nov 2018 in cs.CV

Abstract: Temporal action proposal generation is an important task, aiming to localize the video segments containing human actions in an untrimmed video. In this paper, we propose a multi-granularity generator (MGG) to perform the temporal action proposal from different granularity perspectives, relying on the video visual features equipped with the position embedding information. First, we propose to use a bilinear matching model to exploit the rich local information within the video sequence. Afterwards, two components, namely segment proposal producer (SPP) and frame actionness producer (FAP), are combined to perform the task of temporal action proposal at two distinct granularities. SPP considers the whole video in the form of feature pyramid and generates segment proposals from one coarse perspective, while FAP carries out a finer actionness evaluation for each video frame. Our proposed MGG can be trained in an end-to-end fashion. By temporally adjusting the segment proposals with fine-grained frame actionness information, MGG achieves the superior performance over state-of-the-art methods on the public THUMOS-14 and ActivityNet-1.3 datasets. Moreover, we employ existing action classifiers to perform the classification of the proposals generated by MGG, leading to significant improvements compared against the competing methods for the video detection task.

Authors (5)

Yuan Liu (342 papers)
Lin Ma (206 papers)
Yifeng Zhang (27 papers)
Wei Liu (1135 papers)
Shih-Fu Chang (131 papers)

Citations (191)

View on Semantic Scholar

Summary

Multi-Granularity Generator for Temporal Action Proposal

The paper "Multi-granularity Generator for Temporal Action Proposal" presents a novel approach to the temporal action proposal task, focusing on the precise localization of human action segments in untrimmed videos. The importance of this task is highlighted by its applications in video analysis, including action recognition, summarization, grounding, and captioning. The authors introduce the Multi-granularity Generator (MGG) that elegantly combines segment proposal and frame actionness evaluations, thereby enhancing proposal precision and recall.

The MGG is partitioned into two major components: the Segment Proposal Producer (SPP) and the Frame Actionness Producer (FAP). The SPP perceives video features and generates proposals of varied temporal durations in a coarse manner, utilizing a U-shape architecture akin to feature pyramid networks with lateral connections to enhance semantic feature richness across the pyramid levels. However, the authors observe that segment-level proposals may suffer from imprecise boundaries. To counteract this, the FAP evaluates each frame, assigning probabilities to the likelihood of being the start, end, or middle of an action instance, thus ensuring refined boundary precision.

Notable numerical results include MGG's superior performance over state-of-the-art methods on the public THUMOS-14 and ActivityNet-1.3 datasets. On THUMOS-14, MGG achieves an AR@1000 of 64.06% with two-stream features, outperforming the Boundary Sensitive Network (BSN), a leading contemporary approach. On ActivityNet-1.3, MGG achieves an AUC of 66.43 and an AR@100 of 74.54 on the validation set, superseding previous benchmark performances.

The paper further introduces a temporal boundary adjustment (TBA) module, leveraging FAP's fine-grained actionness scores to refine SPP's segment proposals. The iterative refinement achieved through the two stages of TBA plays a crucial role in enhancing boundary precision and generating high-quality proposals.

In conclusion, this paper presents MGG as a substantial advancement in temporal action proposal generation, providing significant improvements in both recall and precision. The integration of multi-granular analysis and the ability for end-to-end training underscore the model's robustness. Given the increasing demand for more effective video analysis tools, the implications of this research could be broadly applied in enhancing the quality of action detection systems. Future developments may explore embedding this multi-granularity framework in real-time systems or extending it to address challenges in action prediction and real-time event detection in diverse video datasets.

PDF Markdown

Related Papers

Find Related Papers