Pose-conditioned Spatio-Temporal Attention for Human Action Recognition (1703.10106v2)

Published 29 Mar 2017 in cs.CV

Abstract: We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. We evaluate the method on 3 datasets. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D, as well as on the SBU Kinect Interaction dataset. Performance close to state-of-the-art is achieved on the smaller MSR Daily Activity 3D dataset.

Authors (3)

Fabien Baradel (15 papers)
Christian Wolf (148 papers)
Julien Mille (6 papers)

Citations (77)

View on Semantic Scholar

Summary

Overview of Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

The paper "Pose-conditioned Spatio-Temporal Attention for Human Action Recognition," authored by Fabien Baradel, Christian Wolf, and Julien Mille, presents an approach for improving human action recognition by leveraging multi-modal video data—specifically articulated pose and RGB frames. This method introduces a novel two-stream architecture aimed to merge the strengths of pose data and RGB imagery through a strategic application of convolutional and recurrent networks, augmented by significant attention mechanisms.

The methodology is divided into two primary components:

Convolutional Pose Stream: The pose stream processes sequences of 3D joint data using a convolutional neural network to capture spatio-temporal dynamics. The authors introduced a unique ordering for joints that considers the human body's topology, ensuring that convolutional layers extract increasingly abstract features across multiple layers from neighboring joint interactions. This hierarchical representation allows for efficient learning without recurrent structures, relying on convolution to discern temporal patterns in human motion.
RGB Stream with Pose-conditioned Attention: This stream employs a spatio-temporal soft attention mechanism on RGB video data, conditioned on features drawn from the pose stream. The attention system is structured around an LSTM network receiving image sections (glimpses) at specified locations—concretely, the hands of persons involved in the captured interactions. By conditioning the spatial attention on pose descriptors, the model can adjust its focus dynamically, which enhances the ability to detect subtle actions involving hand motions and associated object interactions. The temporal aggregation of features is additionally refined with a temporal attention model that adaptively pools outputs over sequences.

Experimental Validation and Results

The effectiveness of their approach is validated across three datasets: NTU-RGB+D, SBU Kinect Interaction, and MSR Daily Activity 3D. Noteworthy, the model achieved state-of-the-art results on the NTU-RGB+D and SBU datasets and showed competitive performance on the MSR Daily Activity 3D dataset. Specifically, strong performance was observed in cross-subject and cross-view scenarios, emphasizing the model's robustness to variations in the data.

Numerical results highlight the practical benefits of integrating pose-conditioned attention. The pose stream alone achieves impressive accuracy, surpassing many contemporary methods using extensive pose data. The complete two-stream model, which fuses both RGB and pose data, further improves performance, demonstrating the value of multi-modal integration.

Implications and Future Directions

The proposed method contributes significantly to the domain of human activity recognition by efficiently combining pose and RGB data. The introduction of a pose-conditioned attention mechanism marks a methodological advancement, offering a more contextually aware analysis of human activities. The explicit focus on hand-related features using attention mechanisms potentially opens new pathways for applications in detailed action recognition scenarios, such as interaction-level activity understanding in robotics or advanced human-computer interaction systems.

Future research directions may include extending this method to handle additional modalities, such as depth data, or improving computational efficiency for real-time applications. Furthermore, the exploration of transfer learning demonstrated here invites additional investigation into cross-domain or cross-task learning scenarios, expanding the method's applicability in resource-constrained settings.

This paper underlines the utility of a structured approach to action recognition that selectively utilizes both spatial and temporal cues, setting a precedent for future work on context-rich activity recognition systems.

PDF Markdown

Related Papers

Find Related Papers