Learning Salient Boundary Feature for Anchor-free Temporal Action Localization (2103.13137v1)

Published 24 Mar 2021 in cs.CV and cs.AI

Abstract: Temporal action localization is an important yet challenging task in video understanding. Typically, such a task aims at inferring both the action category and localization of the start and end frame for each action instance in a long, untrimmed video.While most current models achieve good results by using pre-defined anchors and numerous actionness, such methods could be bothered with both large number of outputs and heavy tuning of locations and sizes corresponding to different anchors. Instead, anchor-free methods is lighter, getting rid of redundant hyper-parameters, but gains few attention. In this paper, we propose the first purely anchor-free temporal localization method, which is both efficient and effective. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module to gather more valuable boundary features for each proposal with a novel boundary pooling, and (iii) several consistency constraints to make sure our model can find the accurate boundary given arbitrary proposals. Extensive experiments show that our method beats all anchor-based and actionness-guided methods with a remarkable margin on THUMOS14, achieving state-of-the-art results, and comparable ones on ActivityNet v1.3. Code is available at https://github.com/TencentYoutuResearch/ActionDetection-AFSD.

Authors (9)

Chuming Lin (11 papers)
Chengming Xu (26 papers)
Donghao Luo (34 papers)
Yabiao Wang (93 papers)
Ying Tai (88 papers)
Chengjie Wang (178 papers)
Jilin Li (41 papers)
Feiyue Huang (76 papers)
Yanwei Fu (199 papers)

Citations (228)

View on Semantic Scholar

Summary

Learning Salient Boundary Feature for Anchor-free Temporal Action Localization

Temporal Action Localization (TAL) is a fundamental area in video understanding that aims to identify not only the categories of actions in video sequences but also their precise temporal boundaries within long, untrimmed videos. Traditional methods predominantly rely on two approaches: anchor-based models and actionness-guided models, both of which have significant limitations in scalability and efficiency due to their dependency on pre-defined anchors and exhaustive proposal generation. This paper introduces a novel, purely anchor-free model for TAL, specifically focusing on learning salient boundary features conducive to more efficient and flexible action localization.

Contributions and Methodology

The paper presents an anchor-free model named Anchor-Free Saliency-based Detector (AFSD) featuring several novel components:

Basic Predictor: The model leverages a backbone architecture that processes input video data through a feature pyramid network, which facilitates the generation of coarse proposals consisting of action classes and boundary predictions. The anchor-free approach inherently reduces the dependency on pre-set anchor parameters, addressing the hyper-parameter tuning challenges associated with traditional methods.
Saliency-Based Refinement: The model introduces an innovative boundary pooling mechanism to refine coarse proposals by extracting moment-level salient features at action boundaries. This procedure is complemented with a feature pyramid and an additional frame-level feature, providing a finer granularity for temporal boundary detection.
Boundary Consistency Learning (BCL): BCL direct refinements through two main strategies: Activation Guided Learning, which generates boundary activation maps, and Boundary Contrastive Learning, which improves feature discrimination by contrasting split action segments against background clips. This dual approach ensures that salient boundary features genuinely represent action boundaries, enhancing the system's detection precision.

Performance and Results

The AFSD model shows robust performance on benchmark datasets, outperforming traditional anchor-based and actionness-guided methods. Specifically, on the THUMOS14 dataset, the model achieves state-of-the-art results with significant improvements in mean Average Precision (mAP) across various temporal Intersection over Union (tIoU) thresholds. Additionally, on the ActivityNet1.3 dataset, the model displays competitive performance, showcasing its versatility across different evaluation conditions.

Implications and Future Work

This work opens up new horizons in the TAL field by demonstrating the advantages of anchor-free methodologies. The reduction in computational overhead and hyper-parameter complexity renders this approach particularly appealing for real-time and large-scale video analysis applications. Furthermore, the integration of sophisticated feature refinement and learning strategies like BCL represents a substantial step towards more accurate and efficient TAL systems.

Looking forward, the paradigm introduced could be extended to incorporate multi-modal data, potentially improving robustness and applicability in complex, real-world video analytics scenarios. Additionally, further exploration of contrastive learning techniques in enhancing temporal localization models could lead to even more precise action detection frameworks.

In conclusion, this paper contributes significantly to the TAL domain by providing a comprehensive framework that challenges the conventional norms of anchor dependency, ultimately promoting the development of more streamlined and effective action localization systems.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - TencentYoutuResearch/ActionDetection-AFSD: Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization" (171 stars)