TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals (1703.06189v2)

Published 17 Mar 2017 in cs.CV

Abstract: Temporal Action Proposal (TAP) generation is an important problem, as fast and accurate extraction of semantically important (e.g. human actions) segments from untrimmed videos is an important step for large-scale video analysis. We propose a novel Temporal Unit Regression Network (TURN) model. There are two salient aspects of TURN: (1) TURN jointly predicts action proposals and refines the temporal boundaries by temporal coordinate regression; (2) Fast computation is enabled by unit feature reuse: a long untrimmed video is decomposed into video units, which are reused as basic building blocks of temporal proposals. TURN outperforms the state-of-the-art methods under average recall (AR) by a large margin on THUMOS-14 and ActivityNet datasets, and runs at over 880 frames per second (FPS) on a TITAN X GPU. We further apply TURN as a proposal generation stage for existing temporal action localization pipelines, it outperforms state-of-the-art performance on THUMOS-14 and ActivityNet.

Citations (457)

View on Semantic Scholar

Summary

The paper introduces TURN, a method that jointly predicts action proposals and refines boundaries via temporal coordinate regression.
It reuses video unit features and builds clip pyramids for efficient processing, enabling speeds over 880 FPS on modern GPUs.
Evaluation with the AR-F metric reveals TURN significantly boosts mAP (e.g., from 19% to 25.6% on THUMOS-14) over previous approaches.

Temporal Unit Regression Network for Temporal Action Proposals

This paper introduces the Temporal Unit Regression Network (TURN), a method designed to improve Temporal Action Proposal (TAP) generation in untrimmed videos. The process of accurately extracting action segments translates directly into more efficient video analysis, offering significant improvements over prior methodologies.

Key Contributions

The TURN model features two innovative aspects: joint prediction of action proposals and boundary refinement via temporal coordinate regression, and computational efficiency through unit feature reuse. These elements allow TURN to significantly outperform existing methods on benchmark datasets such as THUMOS-14 and ActivityNet, achieving remarkably high processing speeds of over 880 FPS on modern GPUs.

Methodology

TURN decomposes videos into short segments termed "video units," which form the basic building blocks for temporal proposals. This decomposition allows for feature reuse, critical for efficiency in processing. Clip pyramids are built at each unit with TURN outputting confidence scores for action instances, alongside regression offsets that refine temporal boundaries.

The network extends beyond basic classification by implementing unit-level temporal coordinate regression. This approach, distinct from frame-level regression, better aligns with the granularity of unit-based features, ensuring more effective boundary localization.

Evaluation Metrics and Results

To evaluate TAP systems, a novel metric, Average Recall vs. Frequency of retrieved proposals (AR-F), is proposed. This metric demonstrates superior correlation with localization performance compared to traditional metrics like AR-AN and AR-N, notably maintaining consistency across varying video lengths and datasets.

Experimental results showcase TURN’s ability to generalize across datasets without requiring fine-tuning. The model outperforms established methods in both AR and mAP, achieving state-of-the-art results in temporal action localization when integrated into existing pipelines. For instance, TURN enhances the mAP from 19% to 25.6% on THUMOS-14 at a tIoU of 0.5, purely through improved proposals.

Implications and Future Directions

TURN’s advancements suggest significant potential for more effective and scalable action recognition systems. Its architecture, integrating regression-based boundary refinement with efficient computation, represents a key step toward practical deployment in large-scale video analysis.

Future research could explore integrating TURN with more advanced visual feature models or extending its framework to other complex video tasks such as joint spatio-temporal action detection. The proposed metric, AR-F, could also stimulate the development of new evaluation standards in video comprehension tasks. Overall, TURN’s methodological contributions and empirical successes make it a valuable reference in the temporal action domain.

PDF Markdown