Few-Shot Transformation of Common Actions into Time and Space (2104.02439v1)

Published 6 Apr 2021 in cs.CV

Abstract: This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query video. We do not require any class labels, interval bounds, or bounding boxes. To address this challenging task, we introduce a novel few-shot transformer architecture with a dedicated encoder-decoder structure optimized for joint commonality learning and localization prediction, without the need for proposals. Experiments on our reorganizations of the AVA and UCF101-24 datasets show the effectiveness of our approach for few-shot common action localization, even when the support videos are noisy. Although we are not specifically designed for common localization in time only, we also compare favorably against the few-shot and one-shot state-of-the-art in this setting. Lastly, we demonstrate that the few-shot transformer is easily extended to common action localization per pixel.

Authors (3)

Pengwan Yang (6 papers)
Pascal Mettes (52 papers)
Cees G. M. Snoek (134 papers)

Citations (9)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Few-Shot Transformation of Common Actions into Time and Space (2104.02439v1)

Summary

Related Papers