This paper introduces a sentiment-based engagement strategy for human-robot interaction, leveraging emotion detection and attention analysis to enable robots to select context-aware behaviors. The core idea involves determining a person's inclination towards interaction by analyzing their emotional state and focus of attention, allowing the robot to adjust its behavior accordingly. The goal is to promote more intuitive interactions by aligning the robot's actions with the human's mood and expectations.
Here's a breakdown:
- Problem Statement: The paper addresses the need for robots to proactively approach or evade people based on context, including mood, attitude, and sentiments towards robots. Estimating interaction willingness is challenging due to vague social signals. The paper aims to bridge the gap between a lack of proactive behavior and advanced mental state predictions.
- Proposed Method: The sentiment analysis combines head pose estimation (to gauge visual attention) and emotion estimation. Based on these features, the system categorizes the human's state, which informs the selection of a robot behavior pattern, going beyond binary engage/disengage strategies.
- Robot Platform: The method is implemented on a TIAGo robot from PAL Robotics, equipped with an RGB-D camera, manipulator arm, torso lift, and speakers. Key robot operations include:
- Head Following: Tracks a moving person, indicating attentiveness.
- Body Following: Rotates the robot base to keep track of the person.
- Torso Lifting: A subtle cue, similar to standing up while greeting.
- Speech: Explicit communication, used when the person is expected to welcome interaction.
- Visual Attention Estimation: Instead of gaze estimation, which is error-prone, the method uses head pose estimation via an ultra light SSD face detector and 6DRepNet to predict yaw, pitch, and roll angles. To ensure real-time processing, the original 6DRepNet backbone is replaced with MobileNetv3-Small, which maintains 90% of the accuracy with only one-tenth of the parameters.
- Emotion Estimation: The method uses automatic facial expression recognition to gather information about the human interaction partner. It uses a multitask network trained on the AffectNet database. The network is trained to predict both basic emotions and valence-arousal (VA) values.
Where:
- is Valence-Arousal value.
- Sentiment Analysis and Approaching Strategy: By combining emotion and attention states, the system selects one of four behaviors: Engage, Attract, Avoid, or Ignore. The behaviors are determined based on a combination of emotional and attention states.
- Implementation Details: The system is implemented using the Robot Operating System (ROS) middleware. Face detection, head pose, and emotion estimation run in separate nodes, and the neural networks are processed on a separate notebook with a Quadro RTX 4000.
- Experimental Results: Initial experiments in a laboratory setting showed that the models provide robust predictions, but the system relies on single-shot detections, which can lead to abrupt state changes.