Unstructured Human Activity Detection from RGBD Images (1107.0169v2)

Published 1 Jul 2011 in cs.RO and cs.CV

Abstract: Being able to detect and recognize human activities is essential for several applications, including personal assistive robotics. In this paper, we perform detection and recognition of unstructured human activity in unstructured environments. We use a RGBD sensor (Microsoft Kinect) as the input sensor, and compute a set of features based on human pose and motion, as well as based on image and pointcloud information. Our algorithm is based on a hierarchical maximum entropy Markov model (MEMM), which considers a person's activity as composed of a set of sub-activities. We infer the two-layered graph structure using a dynamic programming approach. We test our algorithm on detecting and recognizing twelve different activities performed by four people in different environments, such as a kitchen, a living room, an office, etc., and achieve good performance even when the person was not seen before in the training set.

Citations (565)

View on Semantic Scholar

Summary

The paper introduces a hierarchical MEMM to segment and recognize sub-activities in cluttered settings using a two-layered graph structure.
It leverages RGB and depth cues from Kinect sensors to extract pose, motion, and HOG features for robust human activity detection.
Experimental results show 84.7% precision on seen subjects and promising generalization to new individuals in realistic environments.

Unstructured Human Activity Detection from RGBD Images

The paper, "Unstructured Human Activity Detection from RGBD Images," introduces a novel approach for detecting and recognizing human activities using RGBD data from a Microsoft Kinect sensor. This work primarily addresses the challenges of interpreting human actions in unstructured and cluttered environments, which are typical of daily settings like homes and offices. The proposed method is particularly relevant for applications in personal assistive robotics, where understanding human activities can significantly enhance interaction with humans.

Methodology

The authors employ a hierarchical maximum entropy Markov model (MEMM) to model human activities, which inherently involve a sequence of sub-activities. This approach captures the natural hierarchical nature of tasks such as cooking or brushing teeth, which are composed of discrete, identifiable actions. The MEMM is adapted to consider the dynamics of an activity through a two-layered graph structure, allowing for on-the-fly adjustments to variations in task speed and style using dynamic programming.

Features and Inference

The input data comes from the Kinect sensor, providing both RGB images and depth information. The model extracts features related to human pose and motion, including body pose features, hand positions, motion descriptors, and Histogram of Oriented Gradients (HOG) features for both image and depth data. These features enable the MEMM to discern between various activities robustly.

In constructing the MEMM, the paper models the probability distributions of activities and their corresponding sub-activities. The inference process utilizes a dynamic programming framework to select the optimal graph structure, which adapts to the variability in human actions.

Experimental Results

The paper evaluates the model across twelve different activities performed by four individuals in settings such as kitchens and offices. The results demonstrate a precision/recall of 84.7%/83.2% for scenarios where the person had been seen before during training ("have seen") and 67.9%/55.5% for new subjects ("new person"). These outcomes highlight the model’s capability to generalize across different individuals, albeit with expected reductions in performance when applied to unseen subjects.

Implications and Future Directions

This research significantly contributes to the field of activity recognition in robotics and smart environments, particularly due to its reliance on affordable sensor technology. The methodological advancements present in the use of hierarchical MEMM and the dynamic graph structuring provide a foundation for future research in adaptive activity recognition, which could include improvements in handling occlusion and integrating contextual object information for enhanced accuracy.

Further developments could explore the integration of additional sensory modalities or enhance the interpretation of environmental context to improve the system's robustness and applicability in more complex and diverse human settings. As assistive robots become more prevalent, such adaptive and scalable approaches will become increasingly critical.

PDF Markdown