Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gaze-Guided Robotic Manipulation

Updated 19 March 2026
  • Gaze-guided robotic manipulation is a system that uses eye gaze to drive robotic actions through precise calibration and gaze-to-world mapping.
  • The approach integrates direct gaze-to-motion mapping with vision-language models and imitation learning to segment tasks and enhance precision.
  • Advances focus on reducing calibration drift, addressing depth ambiguity, and incorporating multimodal feedback to improve usability and accuracy.

Gaze-guided robotic manipulation refers to systems in which eye gaze is used as the principal modality—either as a direct teleoperation command or as an intent-rich signal for user-robot interaction, autonomous policy learning, shared control, or skill segmentation. These systems leverage real-time gaze tracking to drive robotic arms in tasks ranging from assistive pick-and-place to high-precision assembly, imitation learning segmentation, and dexterous manipulation. Diverse approaches have been explored, spanning direct gaze-to-motion mapping, fused vision-language intent inference, transformer-based imitation frameworks, and active vision with learned gaze policies.

1. Gaze Sensing, Calibration, and Gaze-to-World Mapping

Gaze-guided manipulation systems rely critically on the accuracy, latency, and robustness of eye-tracking and mapping gaze to robot-relevant coordinates. Key methods include:

  • Wearable Eye Trackers and MR Headsets: Modern systems deploy devices like Tobii Pro Glasses (Pro 2/3), Meta Aria glasses, or HTC Focus MR headsets. Sampling rates range from 50 to 120 Hz, with angular accuracy often better than 1°, and spatial precision on the order of 1–2 cm (Baptista et al., 6 May 2025, Shahid et al., 25 Jul 2025, Lin et al., 25 Oct 2025).
  • Calibration and Homography: A typical calibration sequence aligns the 2D gaze vector with known planar targets or AprilTags, deriving a homography HH that warps raw gaze to the interface or workspace. Example:

pi=Hpi,HR3×3,p'_i = H\,p_i, \quad H \in \mathbb{R}^{3 \times 3}\,,

allows mapping from the camera/image frame to the physical interface or robot base frame. For 3D tasks, depth sensors (e.g., RealSense D435i) are leveraged to lift 2D gaze to a 3D point-of-regard (Shafti et al., 2018, Wang et al., 2018).

  • Error Sources and Correction: Drift, partial marker loss, and resolution limitations are frequent error sources. Best practice includes multi-fiducial calibration, on-the-fly recalibration, and fusion with pose tracking (Baptista et al., 6 May 2025, Sardinha et al., 2024).

Typical performance:

2. Gaze-to-Action Pipelines: Architectures and Control Strategies

System architectures for gaze-guided manipulation can be broadly categorized:

Motion planning is often hybrid: gaze selects a target, while fine alignment and collision-free trajectories are planned via standard motion-planners (MoveIt!, RRT-Connect, etc.) (Lin et al., 25 Oct 2025, Shahid et al., 25 Jul 2025).

3. Gaze-Guided Imitation and Skill Learning

Gaze forms the backbone of several contemporary imitation learning and task decomposition pipelines:

  • Task Decomposition via Gaze Transitions: Detecting transitions in fixated landmarks—using median filtering, position and CLIP-feature change detection—yields subtask boundaries. This enables per-subtask policy learning and uniform skill segmentation (Takizawa et al., 25 Jan 2025).
  • Dual-Resolution and Foveated Imitation Learning: Human gaze is leveraged to extract high-res “foveal” crops for fine actions and low-res peripheral vision for coarse reaching. Neural architectures — e.g., dual-branch convolutional nets or foveated transformers — model this fast/slow dichotomy, reducing computation while boosting data efficiency and precision (Kim et al., 2021, Chuang et al., 21 Jul 2025, Kerr et al., 12 Jun 2025).
  • Gaze Prediction as Robotics Memory: Transformer-based sequential models learn to recall past-gaze-driven representations, enabling robots to attend correctly in multi-object tasks with spatial memory demands (Kim et al., 2022).

Key empirical results:

4. Intent Inference, Activity Recognition, and Shared Control

A core capability is the prompt and accurate inference of user intent based on gaze and other behavioral signals:

  • Probabilistic and HMM Models: Gaussian HMMs incorporating area-of-interest (AOI) gaze likelihoods, hand trajectory features, and grasp-trigger signals can predict intended pick/place targets with >80% F1 score and >70% earliness (typically ~1 s before action) (Belardinelli et al., 2022).
  • Vision-Language Semantic Grounding: Integration of VLMs with gaze enables zero-shot intent inference and skill sequencing, outperforming panel-based or gaze-only baseline interfaces in task speed and user effort (Tay et al., 8 Jan 2026).
  • Context-Aware Grammar Parsing: FSMs use gaze-recognized object selections and context to parse user intent into sequences such as Reach–Grasp–Place or more complex grammars (e.g., Pour) (Shafti et al., 2018).

Notable features:

5. System Performance, Usability, and Human Factors

Performance metrics and human studies in the literature report:

Representative Table: System Latency and Performance

System Latency (excl. dwell) Grasp/Intent Acc. Gaze Error
MIHRaGe (Baptista et al., 6 May 2025) 100–150 ms 80–95% 1.3–2.1 cm
RaycastGrasp (Lin et al., 25 Oct 2025) 200–300 ms post-dwell 88–94% 0.05 m
GazeGrasp (Tokmurziyev et al., 13 Jan 2025) ~50 ms per frame N/A (RTM) N/A
NG-EG (Wang et al., 2022) ~3 ms gaze 90–95% ≪ 1 cm*

*Reported "low error," but not explicitly quantified.

6. Limitations, Challenges, and Best Practices

The following system-level limitations and recommendations recur across the literature:

  • Calibration Drift and Depth Ambiguity: Vision-only or 2D projection approaches are limited by calibration stability, absence of depth feedback, and lack of closed-loop servoing for final alignment (Baptista et al., 6 May 2025, Lin et al., 25 Oct 2025). Best practice includes continuous recalibration, depth or stereo integration, and feedback overlays.
  • Visual Disambiguation and Clutter Handling: Systems reliant on bounding-box inclusion or centroid-mapping struggle in cluttered scenes. Instance segmentation (Lin et al., 25 Oct 2025) and vision-language reasoning (Tay et al., 8 Jan 2026) are promising, though computational cost and inference latency remain open challenges.
  • Expressiveness and Interaction Bandwidth: Dwell-based selection, while robust, can limit interaction rate and user comfort. Extension to multimodal interfaces (haptic, EMG, blinking) may enhance expressiveness (Tokmurziyev et al., 13 Jan 2025, Sardinha et al., 2024).
  • Physical GUI Scalability: Physical or diegetic button interfaces are directly intuitive but scale poorly to high-DOF control and risk occlusion (Sardinha et al., 2024).

Recommended architectural best practices include closed-loop feedback, adaptive parameters (dwell time, fixation radius), hybrid planning, and robust continuous calibration (Baptista et al., 6 May 2025, Sardinha et al., 2024).

7. Future Directions

Emerging frontiers and active research topics in gaze-guided robotic manipulation include:

  • Robotic Self-Gaze and Active Vision: Systems where the robot actively chooses its own gaze fixations in service of manipulation (via RL/BC, e.g., EyeRobot) are yielding emergent hand–eye coordination and superior coverage of large or panoptic workspaces with foveal policies (Kerr et al., 12 Jun 2025, Chuang et al., 21 Jul 2025).
  • Foundation Models and Semantic Reasoning: Integration of VLMs or foundation models for zero-shot intent inference from gaze, coupled with dynamic skill composition, is demonstrating generalizable, natural, and scalable interfaces, albeit with high inference latency and unresolved issues in fine-grained geometry (Tay et al., 8 Jan 2026).
  • Generalization and Robustness: Bottleneck-based architectures, out-of-distribution testing, and foveated transformer policies are pushing reusability, precision, and robustness, especially for complex bimanual, high-DOF, and memory-based manipulation (Takizawa et al., 25 Feb 2025, Chuang et al., 21 Jul 2025, Kim et al., 2022).
  • Assistive and Accessibility Applications: Ongoing efforts are focused on accessibility, low-cost deployment, and the adaptation of gaze-guided interfaces for populations with severe motor impairments, integrating AR, physical GUIs, and multimodal feedback for daily living assistance (Sharma et al., 2020, Sardinha et al., 2024).

This field is rapidly evolving, with a trend toward combining user-intent-rich gaze signals, real-time visual perception, high-level semantic reasoning, and robust closed-loop control to enable natural, generalizable human–robot interaction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gaze-Guided Robotic Manipulation.