Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points (1802.07898v4)

Published 22 Feb 2018 in cs.CV

Abstract: We propose a method for human activity recognition from RGB data that does not rely on any pose information during test time and does not explicitly calculate pose information internally. Instead, a visual attention module learns to predict glimpse sequences in each frame. These glimpses correspond to interest points in the scene that are relevant to the classified activities. No spatial coherence is forced on the glimpse locations, which gives the module liberty to explore different points at each frame and better optimize the process of scrutinizing visual information. Tracking and sequentially integrating this kind of unstructured data is a challenge, which we address by separating the set of glimpses from a set of recurrent tracking/recognition workers. These workers receive glimpses, jointly performing subsequent motion tracking and activity prediction. The glimpses are soft-assigned to the workers, optimizing coherence of the assignments in space, time and feature space using an external memory module. No hard decisions are taken, i.e. each glimpse point is assigned to all existing workers, albeit with different importance. Our methods outperform state-of-the-art methods on the largest human activity recognition dataset available to-date; NTU RGB+D Dataset, and on a smaller human action recognition dataset Northwestern-UCLA Multiview Action 3D Dataset. Our code is publicly available at https://github.com/fabienbaradel/glimpse_clouds.

PDF Abstract

Human Activity Recognition Through Glimpse Clouds

The paper "Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points" presents a novel framework for human activity recognition (HAR) using RGB data, eschewing reliance on pose information during both training and testing. This approach diverges from traditional methods that often incorporate articulated poses or depth data as a primary modality for activity recognition. Instead, the authors leverage a visual attention module to autonomously predict sequences of glimpses—dynamic points of interest—across video frames without enforcing spatial coherence.

Methodological Approach

The core innovation of this research is the transition from structured pose information to an unstructured glimpse-based methodology. The process involves two primary stages:

Visual Attention and Glimpse Prediction: The visual attention module predicts sequences of glimpses in each video frame. These glimpses are essentially interest points dynamically determined by the attention mechanism to be relevant for recognizing the activities portrayed in the video. Notably, this method does not impose spatial or temporal constraints on glimpse locations, thereby providing flexibility for the model to adaptively focus on different points across frames.
Distributed Tracking and Recognition: To address the challenge of interpreting unstructured data sequences, the authors deploy recurrent tracking/recognition workers. These workers are distributed entities that process the glimpses, performing motion tracking and activity prediction in a cohesive manner. The assignment of glimpses to workers employs a soft-assignment mechanism through an external memory module, which optimizes the coherence of these assignments across spatial, temporal, and feature spaces. Importantly, the allocation is non-discrete, meaning each glimpse can contribute to multiple workers in varying degrees.

Key Results and Contributions

The proposed method was empirically evaluated on the NTU RGB+D dataset, the largest available dataset for human activity recognition, and a smaller Northwestern-UCLA Multiview Action 3D Dataset. In both instances, the method outperformed state-of-the-art approaches, including those utilizing articulated pose or depth modalities at test time. This highlights the efficacy of their unstructured, pose-independent method.

Significant contributions of the paper include:

Demonstrating a HAR approach that does not require comprehensive pose information, instead relying solely on raw RGB data.
Introducing a framework where human structure awareness is maintained by training the attention process to focus on key human features without explicit supervision.
Proposing a soft-tracking mechanism that facilitates tracking with external memory aids, enhancing model flexibility and adaptability to dynamic scenes.

Implications and Future Directions

The implications of this work are twofold: theoretically, it challenges the necessity of pose data in HAR systems, urging a shift towards more adaptable attention-based strategies. Practically, its application can significantly benefit environments constrained by depth data availability, such as in mobile robotics or resource-limited contexts.

Looking forward, further exploration into the fusion of decisions by multiple workers and their potential specialization would enhance the robustness and granularity of activity recognition tasks. Moreover, integrating this methodology with additional data modalities or exploring its scaling capabilities in more convoluted environments could be fruitful directions for future research.

The paper represents a substantive advancement toward efficient and adaptable human activity recognition, broadening the potential applications of HAR systems in diverse and dynamic real-world settings.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Fabien Baradel (15 papers)
Christian Wolf (148 papers)
Julien Mille (6 papers)
Graham W. Taylor (88 papers)

Citations (143)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos