Hierarchical Long Short-Term Concurrent Memory for Human Interaction Recognition (1811.00270v1)

Published 1 Nov 2018 in cs.CV

Abstract: In this paper, we aim to address the problem of human interaction recognition in videos by exploring the long-term inter-related dynamics among multiple persons. Recently, Long Short-Term Memory (LSTM) has become a popular choice to model individual dynamic for single-person action recognition due to its ability of capturing the temporal motion information in a range. However, existing RNN models focus only on capturing the dynamics of human interaction by simply combining all dynamics of individuals or modeling them as a whole. Such models neglect the inter-related dynamics of how human interactions change over time. To this end, we propose a novel Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) to model the long-term inter-related dynamics among a group of persons for recognizing the human interactions. Specifically, we first feed each person's static features into a Single-Person LSTM to learn the single-person dynamic. Subsequently, the outputs of all Single-Person LSTM units are fed into a novel Concurrent LSTM (Co-LSTM) unit, which mainly consists of multiple sub-memory units, a new cell gate and a new co-memory cell. In a Co-LSTM unit, each sub-memory unit stores individual motion information, while this Co-LSTM unit selectively integrates and stores inter-related motion information between multiple interacting persons from multiple sub-memory units via the cell gate and co-memory cell, respectively. Extensive experiments on four public datasets validate the effectiveness of the proposed H-LSTCM by comparing against baseline and state-of-the-art methods.

Citations (147)

View on Semantic Scholar

Summary

The paper introduces a novel H-LSTCM framework that models individual actions with Single-Person LSTMs and aggregates them using a Concurrent LSTM unit.
It outperforms existing methods by achieving up to 98.33% accuracy on benchmark datasets, demonstrating robust performance across various interaction scenarios.
The framework offers significant practical insights for surveillance, gesture recognition, and interactive systems by capturing long-term, inter-related dynamics.

An Overview of Hierarchical Long Short-Term Concurrent Memory for Human Interaction Recognition

The paper "Hierarchical Long Short-Term Concurrent Memory for Human Interaction Recognition" presents a novel framework designed to enhance the recognition of human interactions within video sequences. This approach capitalizes on the limitations of existing RNN models, which typically fail to account for the intricate inter-related dynamics among multiple individuals interacting within a given scene. The authors propose a new model architecture, the Hierarchical Long Short-Term Concurrent Memory (H-LSTCM), which aims to address these deficiencies by leveraging the strengths of Long Short-Term Memory (LSTM) networks.

Methodology

The paper highlights the limitations of standard LSTMs that treat multiple human interactions as a holistic or independent set of actions without leveraging the dynamic interdependencies inherent in multi-person interactions. The H-LSTCM framework introduces an innovative architecture comprising two major components. First, the model employs Single-Person LSTM units to capture the temporal dynamics of individual actions within a scene. This step ensures that each interacting person's specific motion dynamics are learned in isolation.

Subsequently, these isolated dynamics are processed using a newly designed Concurrent LSTM (Co-LSTM) unit. This component is critical and is where the paper breaks new ground; it consists of multiple sub-memory units, each capturing individual motion information. More importantly, the Co-LSTM aggregates these individual actions to model their concurrent and inter-related dynamics. The use of sub-memory units and co-memory cells in Co-LSTM effectively captures how these interactions evolve over time, something that is not possible in conventional models.

Key Results

The efficacy of the H-LSTCM approach is validated through extensive experiments on four public datasets: BIT, UT, CAD, and VD. These datasets encompass a variety of interaction settings, from simple two-person interactions to more complex group activities. Numerical results demonstrate that the H-LSTCM outperforms existing methods and competitive baselines such as traditional CNN-LSTM hybrids and state-of-the-art RNN-based approaches.

On the BIT dataset, H-LSTCM achieved an accuracy of 94.03%, which surpasses the performance of current techniques such as those proposed by Donahue et al., and Ke et al., which reported significantly lower accuracies. Similarly, on the UT dataset, the model achieved an accuracy of 98.33%, significantly outperforming prior methods that utilized handcrafted features or traditional machine learning approaches. The results on group activity datasets, namely CAD and VD, also show consistent improvements showcasing the model's ability to adapt to more complex interaction scenarios.

Theoretical and Practical Implications

The theoretical implications of the H-LSTCM framework are significant as it provides a structured approach to capturing long-term dependencies and inter-related dynamics in multi-person interaction scenarios. This allows for more accurate modeling of interactions which are crucial for surveillance, video analysis, and human-computer interaction applications. The introduction of the Co-LSTM unit represents a meaningful extension of standard LSTM capabilities, providing a pathway for future research efforts to develop even more sophisticated memory-based structures capable of handling multi-agent dynamics efficiently.

Practically, the H-LSTCM can be readily incorporated into systems where understanding human behavior and interactions are essential, such as in smart video surveillance systems, gesture recognition for interactive systems, or even advanced driver-assistance systems where pedestrian interactions need to be modeled.

Future Directions

Future developments in this area might include exploration into optimizing the Co-LSTM for real-time applications, integrating advanced tracking algorithms, and addressing challenges such as occlusion and varying environmental conditions. Moreover, an intriguing avenue for research could include examining the integration of attention mechanisms within the Co-LSTM to enhance its discriminative power further.

In conclusion, the paper offers substantial contributions to the field of activity recognition by addressing key limitations with an innovative hierarchical LSTM-based network architecture. The paper effectively demonstrates the model's advantages through robust empirical evaluations, paving the way for continued advancements in human interaction recognition.

PDF Markdown

Related Papers

YouTube

Show All Videos