First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations (1704.02463v2)

Published 8 Apr 2017 in cs.CV

Abstract: In this work we study the use of 3D hand poses to recognize first-person dynamic hand actions interacting with 3D objects. Towards this goal, we collected RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations. To obtain hand pose annotations, we used our own mo-cap system that automatically infers the 3D location of each of the 21 joints of a hand model via 6 magnetic sensors and inverse kinematics. Additionally, we recorded the 6D object poses and provide 3D object models for a subset of hand-object interaction sequences. To the best of our knowledge, this is the first benchmark that enables the study of first-person hand actions with the use of 3D hand poses. We present an extensive experimental evaluation of RGB-D and pose-based action recognition by 18 baselines/state-of-the-art approaches. The impact of using appearance features, poses, and their combinations are measured, and the different training/testing protocols are evaluated. Finally, we assess how ready the 3D hand pose estimation field is when hands are severely occluded by objects in egocentric views and its influence on action recognition. From the results, we see clear benefits of using hand pose as a cue for action recognition compared to other data modalities. Our dataset and experiments can be of interest to communities of 3D hand pose estimation, 6D object pose, and robotics as well as action recognition.

PDF Abstract

First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations

The presented paper embarks on exploring the intersection of hand pose estimation and first-person action recognition, proposing a comprehensive benchmark composed of over 100,000 RGB-D frames. This dataset encompasses 45 daily hand action categories involving 26 different objects, meticulously annotated with 3D hand poses and, in some instances, 6D object poses. The annotations were obtained using a motion capture system combining magnetic sensors and inverse kinematics, marking a pioneering effort in providing high-quality real-world annotated hand poses for recognizing first-person dynamic hand actions.

Contributions

The paper's contributions are three-fold:

Dataset Introduction: A well-annotated dataset is developed, facilitating the paper of egocentric hand-object actions and combining fields of 3D hand pose estimation and action recognition. This initiative fills a notable gap where real time-annotated hand pose data was sparse.
Action Recognition Evaluation: The authors evaluate 18 baseline and state-of-the-art approaches on the dataset, including RGB-D and pose-based action recognition methods. The approaches are carefully selected to cover a breadth of methodologies and data modalities.
Hand Pose Exploration: The research assesses state-of-the-art hand pose estimation techniques under conditions of occlusion and evaluates their performance in the context of action recognition.

Experimental Evaluation

The evaluation reveals interesting insights. Notably, hand pose features emerge as a critical cue for action recognition, outperforming both RGB and depth data modalities in terms of accuracy. The paper confirms that combining various data modalities enhances action recognition capabilities, albeit hand poses offer a distinct advantage in dynamic hand-object manipulation scenarios.

The dataset's depth extends to providing 6D object poses, allowing further investigation into how hand and object pose estimations can jointly influence action recognition. An analysis is performed to emphasize the complementary nature of hand and object cues, suggesting that a symbiotic relationship between these features might lead to improved action recognition precision.

Implications and Future Directions

This benchmark stands to significantly influence both practical applications and theoretical developments:

Practical Implications: Robotics, virtual/augmented reality, hand rehabilitation, and teleoperation systems could leverage these insights to improve interaction precision and system responsiveness by utilizing hand pose data as a key information source.
Theoretical Implications: The dataset offers fertile ground for developing new algorithms capable of exploiting the nuanced information captured in hand dynamics during object manipulations, advancing the state of the art in several research areas.
Future Developments: Researchers may explore joint hand-object tracking and develop robust models that can better handle occlusion and varied object interaction types, potentially drawing from both machine learning techniques and sensor fusion strategies.

Conclusion

This work advances the paper of first-person dynamic hand actions, providing a robust dataset and strong numerical insights through comprehensive method evaluations. The evident benefit of hand pose information in action recognition as demonstrated by this research lays foundations for future explorations into more sophisticated models combining hand, object, and environmental context. The paper offers a pathway to explore nuanced human-computer interaction in various technologically driven domains.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Guillermo Garcia-Hernando (20 papers)
Shanxin Yuan (30 papers)
Seungryul Baek (32 papers)
Tae-Kyun Kim (91 papers)

Citations (445)

View on Semantic Scholar

First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations (1704.02463v2)