Co-speech gestures for human-robot collaboration (2311.18285v1)

Published 30 Nov 2023 in cs.RO

Abstract: Collaboration between human and robot requires effective modes of communication to assign robot tasks and coordinate activities. As communication can utilize different modalities, a multi-modal approach can be more expressive than single modal models alone. In this work we propose a co-speech gesture model that can assign robot tasks for human-robot collaboration. Human gestures and speech, detected by computer vision and speech recognition, can thus refer to objects in the scene and apply robot actions to them. We present an experimental evaluation of the multi-modal co-speech model with a real-world industrial use case. Results demonstrate that multi-modal communication is easy to achieve and can provide benefits for collaboration with respect to single modal tools.

PDF HTML Abstract

Co-speech Gestures for Human-Robot Collaboration

Introduction

The paper "Co-speech Gestures for Human-Robot Collaboration" presents a detailed paper on enhancing human-robot interaction using a multi-modal communication model. Authored by Akif Ekrekli, Alexandre Angleraud, Gaurang Sharma, and Roel Pieters from Tampere University, this research addresses the limitations of single-modal communication tools in industrial contexts. The primary focus is on achieving more effective task assignment and coordination through the integration of co-speech gestures.

Methodological Contributions

The researchers propose a co-speech gesture model combining human natural speech, hand gestures, and object detection to facilitate human-robot collaboration. They offer several contributions:

Development of perception methods for understanding human speech and hand gestures.
Establishment of a multi-modal model combining speech and gestures for task assignment.
Comprehensive evaluation of the co-speech model within an industrial use case scenario.

Perception Tools and Multi-modal Methods

The paper emphasizes the limitations of single-modal tools such as speech recognition and gesture detection. While speech can communicate complex commands, it suffers from latency. On the other hand, gestures can be detected quickly but convey limited information. To overcome these challenges, the authors integrate multiple perception tools including:

Lightweight OpenPose for detecting human skeletons and gestures.
Vosk for recognizing predefined speech commands and phrases.
Detectron2 for object detection within the industrial environment.

The implementation involves sensor redundancy, sensor multi-modality, and sensor-fusion methods. Emphasis is placed on sensor-fusion, where different sensor modalities are combined to command single robot actions. For instance, a robot can be instructed to pick a specific object based on a speech command complemented by a pointing gesture.

Experimental Validation

For validation, the authors designed an experiment replicating an industrial assembly task. The setup included two Intel Realsense D435 cameras for visual perception and a microphone for speech recognition, with the computational tasks managed by an NVIDIA GTX 1080 Ti GPU. The robot utilized in the paper was a Franka Emika, controlled through ROS.

The experiments covered:

Single command gestures for stopping and continuing robot motions.
Speech phrases for commanding robot actions like moving to specific locations.
Co-speech gestures combining speech and pointing to direct robots to perform tasks like picking and placing objects or handing them over to a human.

Results showcased that the co-speech gesture model effectively facilitated human-robot coordination, with relevant commands executed reliably and efficiently.

Performance Metrics

The detailed performance assessment revealed:

A detection accuracy rate of around 90% for both wrist and object detection.
Consistent speech recognition performance, albeit with some latency, particularly for non-native English speakers.
Real-time operation with a slight performance drop when combining all perception tools, maintaining a frame rate of approximately 24 FPS for skeleton detection and 4.5 FPS for object detection.

Discussion of Limitations

While the model demonstrated efficacy, some limitations were noted, including:

Reliance on precise human hand motions for accurate wrist detection.
Necessity for careful sensor calibration to ensure accurate correspondence between gesture and object detection.
Latency in speech recognition which could be further optimized.

Future Directions

The paper hints at several practical and theoretical implications:

Enhanced multi-modal models could significantly improve human-robot collaboration in various industrial applications.
Future research may focus on reducing latency in speech recognition and exploring alternative gesture recognition models to enhance detection accuracy and inference speed.
Further integration of depth perception and image fusion techniques to improve object detection precision and robustness.

Conclusion

The paper convincingly demonstrates the potential for multi-modal perception systems in industrial human-robot collaboration. By effectively merging speech, gestures, and object detection, the proposed co-speech gesture model offers a robust framework for more intuitive and efficient human-robot interactions. Practical deployment of such models can pave the way for more sophisticated collaborative robots capable of seamless integration into human-centric workflows.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

A. Ekrekli (1 paper)
A. Angleraud (4 papers)
G. Sharma (35 papers)
R. Pieters (6 papers)

Citations (2)

View on Semantic Scholar

Co-speech gestures for human-robot collaboration (2311.18285v1)

Co-speech Gestures for Human-Robot Collaboration

Related Papers