SSCAP: Self-supervised Co-occurrence Action Parsing for Unsupervised Temporal Action Segmentation (2105.14158v3)

Published 29 May 2021 in cs.CV

Abstract: Temporal action segmentation is a task to classify each frame in the video with an action label. However, it is quite expensive to annotate every frame in a large corpus of videos to construct a comprehensive supervised training dataset. Thus in this work we propose an unsupervised method, namely SSCAP, that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. SSCAP leverages Self-Supervised learning to extract distinguishable features and then applies a novel Co-occurrence Action Parsing algorithm to not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal path of the sub-actions in an accurate and general way. We evaluate on both classic datasets (Breakfast, 50Salads) and the emerging fine-grained action dataset (FineGym) with more complex activity structures and similar sub-actions. Results show that SSCAP achieves state-of-the-art performance on all datasets and can even outperform some weakly-supervised approaches, demonstrating its effectiveness and generalizability.

Authors (7)

Zhe Wang (574 papers)
Hao Chen (1006 papers)
Xinyu Li (136 papers)
Chunhui Liu (23 papers)
Yuanjun Xiong (52 papers)
Joseph Tighe (30 papers)
Charless Fowlkes (35 papers)

Citations (17)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

SSCAP: Self-supervised Co-occurrence Action Parsing for Unsupervised Temporal Action Segmentation (2105.14158v3)

Summary

Related Papers