AutoLoc: Weakly-supervised Temporal Action Localization (1807.08333v2)

Published 22 Jul 2018 in cs.CV

Abstract: Temporal Action Localization (TAL) in untrimmed video is important for many applications. But it is very expensive to annotate the segment-level ground truth (action class and temporal boundary). This raises the interest of addressing TAL with weak supervision, namely only video-level annotations are available during training). However, the state-of-the-art weakly-supervised TAL methods only focus on generating good Class Activation Sequence (CAS) over time but conduct simple thresholding on CAS to localize actions. In this paper, we first develop a novel weakly-supervised TAL framework called AutoLoc to directly predict the temporal boundary of each action instance. We propose a novel Outer-Inner-Contrastive (OIC) loss to automatically discover the needed segment-level supervision for training such a boundary predictor. Our method achieves dramatically improved performance: under the IoU threshold 0.5, our method improves mAP on THUMOS'14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. It is also very encouraging to see that our weakly-supervised method achieves comparable results with some fully-supervised methods.

Authors (5)

Zheng Shou (16 papers)
Hang Gao (61 papers)
Lei Zhang (1689 papers)
Kazuyuki Miyazawa (3 papers)
Shih-Fu Chang (131 papers)

Citations (250)

View on Semantic Scholar

Summary

AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos

The paper "AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos" addresses the significant challenge of temporal action localization (TAL) in lengthy, untrimmed video sequences, where annotating the segment-level ground truth for training is expensive and labor-intensive. Traditional approaches require fully-supervised learning with detailed annotations that include both action class and temporal boundaries, which may not always be feasible, particularly with larger datasets. This paper proposes AutoLoc, a novel framework that leverages weak supervision — specifically, only video-level annotations — to predict the temporal boundaries of action instances more directly and effectively.

Framework and Methodology

One of the central innovations of the AutoLoc framework is the introduction of the Outer-Inner-Contrastive (OIC) loss. This novel loss function is designed to automatically discern segment-level supervision from video-level annotations. The OIC loss facilitates the estimation of temporal boundaries by contrasting activations within the action boundary against those in surrounding regions. This is achieved by maximizing the activation contrast between the inner and outer segments of the hypothesized action period, aiming to localize temporal boundaries accurately.

The authors develop a segment boundary prediction model without requiring fully annotated temporal boundaries, which is a departure from traditional TAL methods. The AutoLoc framework features a multi-anchor regression strategy to predict not only the occurrence but also the duration and precise boundaries of actions within videos. The prediction mechanism is class-agnostic during training, thus fostering robust boundary speculation adaptable to unseen actions.

Empirical Results

The effectiveness of the AutoLoc framework is demonstrated through extensive experiments on two standard datasets: THUMOS'14 and ActivityNet. Remarkably, AutoLoc achieves a mean Average Precision (mAP) improvement from 13.7% to 21.2% on THUMOS'14 and from 7.4% to 27.3% on ActivityNet under the IoU threshold of 0.5. These improvements represent substantial relative gains of 54.7% and 268.9%, respectively, highlighting the increased efficacy of AutoLoc over competing weakly-supervised methods. Furthermore, AutoLoc's performance competes favorably with some fully-supervised approaches, demonstrating its potential as a practical alternative where obtaining full annotations is untenable.

Implications and Future Directions

AutoLoc's emergence as a robust weakly-supervised method for temporal action localization underscores a pivotal shift in video action recognition and localization paradigms, emphasizing efficiency in leveraging minimal annotation data. By utilizing video-level labels to effectively learn segment boundaries with the proposed OIC loss, AutoLoc paves the way for broader applicability of TAL frameworks in domains with limited annotations.

Future work may focus on refining the regression model's capacity to predict even more precise temporal boundaries and adapt AutoLoc's underlying principles to spatio-temporal action detection and other related tasks that benefit from reduced annotation requirements. It would be particularly interesting to explore extensions of AutoLoc to object detection in static images or adjustments for fine-grained action recognition, utilizing its contrastive loss principle to maximize contextual learning from weakly-supervised data sources. Indeed, AutoLoc opens up several promising research avenues within the broader context of video understanding in artificial intelligence.

PDF Markdown

Related Papers

Find Related Papers