MindEye-OmniAssist: A Gaze-Driven LLM-Enhanced Assistive Robot System for Implicit Intention Recognition and Task Execution (2503.13250v1)

Published 17 Mar 2025 in cs.RO and cs.HC

Abstract: A promising effective human-robot interaction in assistive robotic systems is gaze-based control. However, current gaze-based assistive systems mainly help users with basic grasping actions, offering limited support. Moreover, the restricted intent recognition capability constrains the assistive system's ability to provide diverse assistance functions. In this paper, we propose an open implicit intention recognition framework powered by LLM and Vision Foundation Model (VFM), which can process gaze input and recognize user intents that are not confined to predefined or specific scenarios. Furthermore, we implement a gaze-driven LLM-enhanced assistive robot system (MindEye-OmniAssist) that recognizes user's intentions through gaze and assists in completing task. To achieve this, the system utilizes open vocabulary object detector, intention recognition network and LLM to infer their full intentions. By integrating eye movement feedback and LLM, it generates action sequences to assist the user in completing tasks. Real-world experiments have been conducted for assistive tasks, and the system achieved an overall success rate of 41/55 across various undefined tasks. Preliminary results show that the proposed method holds the potential to provide a more user-friendly human-computer interaction interface and significantly enhance the versatility and effectiveness of assistive systems by supporting more complex and diverse task.

Summary

The paper introduces MindEye-OmniAssist, a gaze-driven, LLM-enhanced assistive robot system that uses implicit intention recognition to interpret user commands for task execution.
The system architecture integrates gaze tracking with computer vision (YOLO-World) and an LLM (DeepSeek-R1) to understand user intent from eye movements, featuring a gaze-based confirmation step.
Experimental results show high success rates in gaze recognition and LLM-based planning, with task failures primarily attributed to limitations in robotic arm execution accuracy and environmental interaction.

The paper "MindEye-OmniAssist: A Gaze-Driven LLM-Enhanced Assistive Robot System for Implicit Intention Recognition and Task Execution" (2503.13250) introduces a novel assistive robotic system that leverages gaze tracking, LLMs, and VFMs to enable more intuitive and versatile human-robot interaction. The core innovation lies in its open implicit intention recognition framework, which moves beyond predefined scenarios to interpret user intents directly from gaze input.

System Architecture and Components

MindEye-OmniAssist integrates a head-mounted eye tracker for gaze input, collaborative robotic arms for task execution, and cameras for visual scene understanding. The software architecture is modular, comprising three key components:

Gaze-based Implicit Intention Recognition: This module employs YOLO-World, an open vocabulary object detector, to identify objects within the user's field of view. A custom intention recognition network, built upon multi-scale convolutional layers, positional encoders, Transformer encoders, and fully connected layers, processes gaze data to classify the user's intention to interact with each detected object. The objects of interest are then passed to an LLM (DeepSeek-R1) to infer the comprehensive user intention. This stage is critical for translating low-level gaze data into high-level semantic understanding.
User Intention Confirmation: To ensure accuracy and user control, the system incorporates a voice-based confirmation step. The system vocalizes its interpretation of the user's intention, and the user can provide feedback by shifting their gaze to designated "Agree" or "Reject" areas overlaid on the scene image, which is captured by the eye tracker. This feedback loop enhances the robustness of the intention recognition process.
Motion Planning: Once the user's intention is confirmed, the system uses the LLM to generate a sequence of robotic arm actions required to fulfill the intended task. This action sequence includes steps such as object localization, grasping, placing, moving, and pouring. These actions are executed through custom-developed APIs that interface with the robotic arm controllers.

Experimental Evaluation and Results

The system was evaluated in real-world experiments involving a variety of assistive tasks, including fetching objects, manipulating objects (e.g., placing one object inside another), watering plants, toggling switches, and pouring liquids. The experiments demonstrated an overall task success rate of 41 out of 55 attempts. Further analysis revealed high success rates in the individual recognition (52/55) and planning (50/52) stages, highlighting the effectiveness of the gaze-based intention recognition and LLM-driven motion planning components.

Limitations and Future Directions

The paper identifies limitations primarily related to the robotic arm action execution. Task failures were often attributed to issues such as the robotic arm's inability to avoid obstacles and limitations in execution accuracy. The authors also note that the system currently lacks real-time environmental information integration, which is crucial for detailed action planning and robust obstacle avoidance. Another limitation stems from the behavior cloning approach used to train the basic actions of the system, which restricts the diversity of the action set. Future work will focus on incorporating real-time environmental feedback, enhancing the robotic arm's dexterity and control precision, and expanding the range of available actions to improve the system's overall reliability and versatility.