VPN: Learning Video-Pose Embedding for Activities of Daily Living (2007.03056v1)

Published 6 Jul 2020 in cs.CV

Abstract: In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

Srijan Das (35 papers)
Saurav Sharma (10 papers)
Rui Dai (28 papers)
Francois Bremond (114 papers)
Monique Thonnat (9 papers)

Citations (104)

View on Semantic Scholar

VPN: Learning Video-Pose Embedding for Activities of Daily Living (2007.03056v1)

Related Papers