Learning Human Activities and Object Affordances from RGB-D Videos (1210.1207v2)

Published 4 Oct 2012 in cs.RO, cs.AI, and cs.CV

Abstract: Understanding human activities and object affordances are two very important skills, especially for personal robots which operate in human environments. In this work, we consider the problem of extracting a descriptive labeling of the sequence of sub-activities being performed by a human, and more importantly, of their interactions with the objects in the form of associated affordances. Given a RGB-D video, we jointly model the human activities and object affordances as a Markov random field where the nodes represent objects and sub-activities, and the edges represent the relationships between object affordances, their relations with sub-activities, and their evolution over time. We formulate the learning problem using a structural support vector machine (SSVM) approach, where labelings over various alternate temporal segmentations are considered as latent variables. We tested our method on a challenging dataset comprising 120 activity videos collected from 4 subjects, and obtained an accuracy of 79.4% for affordance, 63.4% for sub-activity and 75.0% for high-level activity labeling. We then demonstrate the use of such descriptive labeling in performing assistive tasks by a PR2 robot.

Citations (711)

View on Semantic Scholar

Summary

The paper introduces a joint modeling approach using MRF and SSVM to simultaneously detect human sub-activities and object affordances in RGB-D videos.
The method segments video frames into nodes for objects and sub-activities, linking them with spatial and temporal edges to capture detailed interactions.
Evaluations on CAD-60 and CAD-120 reveal robust performance, with 79.4% accuracy in affordance detection and 63.4% in sub-activity labeling.

Learning Human Activities and Object Affordances from RGB-D Videos

Understanding the nuances of human activities and object affordances significantly benefits personal robots operating in human environments. The paper authored by Koppula et al., from Cornell University, contributes to this domain by proposing a method for extracting descriptive labels of sub-activities and their interactions with objects using RGB-D videos. This research models human activities and object affordances jointly as a Markov Random Field (MRF), where nodes represent objects and sub-activities, and edges capture their relationships and evolution over time.

Methodology

The authors’ approach hinges on the paradigm that the understanding of human activities is enhanced when combined with the detection of object affordances. For this, each sequence of frames from an RGB-D video is segmented temporally, creating nodes for sub-activities and objects that interact through edges representing both spatial and temporal relationships. These segmentations are used to train a structural Support Vector Machine (SSVM).

Key Components:

MRF Construction: Every frame in a segment is represented by nodes corresponding to sub-activities and objects, while edges model:
- Object-Object interactions,
- Object-Sub-Activity relationships,
- Temporal changes in object affordances and sub-activities.
Feature Extraction: Features for object nodes include centroid location, 2D bounding box, SIFT transformation matrix, and movement characteristics. Sub-activity features are derived from skeleton tracking data from RGB-D frames, capturing joint positions and body posture dynamics.
Temporal Segmentation: Considering the inherent variability and potential overlap in human activities, multiple segmentations are hypothesized over RGB-D sequences. Methods like uniform segmentation, graph-based segmentation using Euclidean distances and rate of change are employed. The best segmentation hypothesis is determined through learning from these multiple segmentations.
Learning and Inference: Model parameters are learned using SSVM, which optimally labels sub-activities and affordances. The end-to-end inference employs mixed-integer programming and graph-cut algorithms to solve the labeling of the segments efficiently.

Results

The proposed methodology was evaluated on the Cornell Activity Dataset (CAD-60) and a new dataset (CAD-120), collected for this research, comprising 120 video sequences of human activities. On CAD-120, their method achieved:

79.4% accuracy in affordance detection,
63.4% in sub-activity detection,
75.0% in high-level activity classification, demonstrating robust performance even with subjects not seen in the training data.

Strong Numerical Results:

CAD-60 Evaluation: The authors' method significantly surpassed baseline performance, demonstrating the superiority of joint modeling of activities and affordances.
Detailed Analyses: Breakdown of performance underscored the impact of each component (node features, edge features, temporal interactions) on overall accuracy, validating the importance of integrated learning.

Implications and Future Work

The joint modeling of activities and affordances proves to be a practical approach for assistive robotics. By understanding the intricate interactions and transitions between sub-activities and object usages, a robot can perform complementary and supportive tasks in a human environment more efficiently. This capability is exemplified in the robotic experiments where a PR2 robot performed tasks such as cleaning and object manipulation by understanding context from human actions.

Theoretical Implications:

The method advances the integration of context-based learning into robot perception, laying the groundwork for adaptive learning models in dynamic environments.

Future Directions:

Envisioning enhancements in object tracking, human pose estimation, and task planning will bolster the robustness of the technique. Extending this research could involve deploying robots in more complex, unstructured environments, and integrating adaptive learning mechanisms that allow real-time updating of activity and affordance models from continual robotic experiences.

In conclusion, the paper by Koppula et al. represents a substantial step towards intelligent assistive robots that leverage contextual understanding of human activities and affordances acquired from RGB-D videos. The fundamental advancements in the joint modeling of these elements have the potential to significantly ameliorate robot-human interaction in practical applications.

PDF Markdown