Gaze-Guided Robotic Manipulation

Updated 19 March 2026

Gaze-guided robotic manipulation is a system that uses eye gaze to drive robotic actions through precise calibration and gaze-to-world mapping.
The approach integrates direct gaze-to-motion mapping with vision-language models and imitation learning to segment tasks and enhance precision.
Advances focus on reducing calibration drift, addressing depth ambiguity, and incorporating multimodal feedback to improve usability and accuracy.

Gaze-guided robotic manipulation refers to systems in which eye gaze is used as the principal modality—either as a direct teleoperation command or as an intent-rich signal for user-robot interaction, autonomous policy learning, shared control, or skill segmentation. These systems leverage real-time gaze tracking to drive robotic arms in tasks ranging from assistive pick-and-place to high-precision assembly, imitation learning segmentation, and dexterous manipulation. Diverse approaches have been explored, spanning direct gaze-to-motion mapping, fused vision-language intent inference, transformer-based imitation frameworks, and active vision with learned gaze policies.

1. Gaze Sensing, Calibration, and Gaze-to-World Mapping

Gaze-guided manipulation systems rely critically on the accuracy, latency, and robustness of eye-tracking and mapping gaze to robot-relevant coordinates. Key methods include:

Wearable Eye Trackers and MR Headsets: Modern systems deploy devices like Tobii Pro Glasses (Pro 2/3), Meta Aria glasses, or HTC Focus MR headsets. Sampling rates range from 50 to 120 Hz, with angular accuracy often better than 1°, and spatial precision on the order of 1–2 cm (Baptista et al., 6 May 2025, Shahid et al., 25 Jul 2025, Lin et al., 25 Oct 2025).
Calibration and Homography: A typical calibration sequence aligns the 2D gaze vector with known planar targets or AprilTags, deriving a homography $H$ that warps raw gaze to the interface or workspace. Example:

$p'_i = H\,p_i, \quad H \in \mathbb{R}^{3 \times 3}\,,$

allows mapping from the camera/image frame to the physical interface or robot base frame. For 3D tasks, depth sensors (e.g., RealSense D435i) are leveraged to lift 2D gaze to a 3D point-of-regard (Shafti et al., 2018, Wang et al., 2018).

Error Sources and Correction: Drift, partial marker loss, and resolution limitations are frequent error sources. Best practice includes multi-fiducial calibration, on-the-fly recalibration, and fusion with pose tracking (Baptista et al., 6 May 2025, Sardinha et al., 2024).

Typical performance:

Mean fixation error: 1.3–4.7 cm depending on system and calibration (Baptista et al., 6 May 2025, Shafti et al., 2018).
End-to-end latency (excluding dwell): 20–300 ms (Lin et al., 25 Oct 2025, Tokmurziyev et al., 13 Jan 2025, Wang et al., 2022).

2. Gaze-to-Action Pipelines: Architectures and Control Strategies

System architectures for gaze-guided manipulation can be broadly categorized:

Direct Gaze-to-Motion Mapping: Gaze fixation mapped to a 2D/3D point, triggering robot motion (e.g., Cartesian reaching), possibly augmented with dwell-based selection or diegetic GUIs (Sardinha et al., 2024, Lin et al., 25 Oct 2025).
Object-Aware Intent Recognition: Gaze is mapped to detected object bounding boxes or segments; selection is confirmed via dwell time, then manipulation primitive (pick, place) is executed (Shahid et al., 25 Jul 2025, Lin et al., 25 Oct 2025). Advanced variants use MR passthrough with raycasting for precise selection (Lin et al., 25 Oct 2025).
Semantic Reasoning with Foundation Models: The fixation triggers region/s in the scene, which are processed by a vision-LLM (e.g., Gemini Pro) to infer high-level intent, from which a sequence of parameterized robot skills is composed (Tay et al., 8 Jan 2026).
Feedback and Assistive Mechanisms: Most systems overlay gaze cursors and robot intent indicators in MR or video streams; audio or haptic feedback may be used to confirm selections and reduce ambiguity (Baptista et al., 6 May 2025, Shahid et al., 25 Jul 2025).

Motion planning is often hybrid: gaze selects a target, while fine alignment and collision-free trajectories are planned via standard motion-planners (MoveIt!, RRT-Connect, etc.) (Lin et al., 25 Oct 2025, Shahid et al., 25 Jul 2025).

3. Gaze-Guided Imitation and Skill Learning

Gaze forms the backbone of several contemporary imitation learning and task decomposition pipelines:

Task Decomposition via Gaze Transitions: Detecting transitions in fixated landmarks—using median filtering, position and CLIP-feature change detection—yields subtask boundaries. This enables per-subtask policy learning and uniform skill segmentation (Takizawa et al., 25 Jan 2025).
Dual-Resolution and Foveated Imitation Learning: Human gaze is leveraged to extract high-res “foveal” crops for fine actions and low-res peripheral vision for coarse reaching. Neural architectures — e.g., dual-branch convolutional nets or foveated transformers — model this fast/slow dichotomy, reducing computation while boosting data efficiency and precision (Kim et al., 2021, Chuang et al., 21 Jul 2025, Kerr et al., 12 Jun 2025).
Gaze Prediction as Robotics Memory: Transformer-based sequential models learn to recall past-gaze-driven representations, enabling robots to attend correctly in multi-object tasks with spatial memory demands (Kim et al., 2022).

Key empirical results:

Perfect or near-perfect subtask segmentation after simple gaze-based processing (Takizawa et al., 25 Jan 2025).
Foveated ViTs achieve >90% success in high-precision tasks, 7× training speedup compared to uniform ViTs (Chuang et al., 21 Jul 2025).
Gaze-based decomposition and foveation enable out-of-distribution generalization, outperforming non-gaze approaches (Takizawa et al., 25 Feb 2025).

4. Intent Inference, Activity Recognition, and Shared Control

A core capability is the prompt and accurate inference of user intent based on gaze and other behavioral signals:

Probabilistic and HMM Models: Gaussian HMMs incorporating area-of-interest (AOI) gaze likelihoods, hand trajectory features, and grasp-trigger signals can predict intended pick/place targets with >80% F1 score and >70% earliness (typically ~1 s before action) (Belardinelli et al., 2022).
Vision-Language Semantic Grounding: Integration of VLMs with gaze enables zero-shot intent inference and skill sequencing, outperforming panel-based or gaze-only baseline interfaces in task speed and user effort (Tay et al., 8 Jan 2026).
Context-Aware Grammar Parsing: FSMs use gaze-recognized object selections and context to parse user intent into sequences such as Reach–Grasp–Place or more complex grammars (e.g., Pour) (Shafti et al., 2018).

Notable features:

Early-stage prediction is dominated by gaze fixations; motion features enhance performance but are less discriminative on their own (Belardinelli et al., 2022).
Visual feedback and multimodal confirmation reduce cognitive load and unpredictability (Baptista et al., 6 May 2025, Sardinha et al., 2024).

5. System Performance, Usability, and Human Factors

Performance metrics and human studies in the literature report:

Accuracy and Task Completion: Gaze-based systems routinely achieve pick/place or grasp/intent recognition rates >85–95%, even for users with impairments (Baptista et al., 6 May 2025, Lin et al., 25 Oct 2025, Wang et al., 2022). Placement or dynamic manipulation is more error-prone due to the lack of depth estimation in 2D-only systems (Baptista et al., 6 May 2025).
Usability Metrics: System Usability Scores (SUS) in the range 75–81 and NASA-TLX workload measures ~45 indicate good usability and moderate input effort, with minimal prior training required (Sardinha et al., 2024, Shahid et al., 25 Jul 2025).
Latency and User Fatigue: Dwell times (1–3 s) are standard to ensure intentional selection but can lead to perceptual delay and eye fatigue; adaptive dwell or gesture-based alternatives (e.g., blink, EMG) are proposed as remedies (Baptista et al., 6 May 2025, Tokmurziyev et al., 13 Jan 2025).
Participant Demographics: Studies involve up to 30 able-bodied participants (Shahid et al., 25 Jul 2025, Sardinha et al., 2024) and smaller groups (N=4–13) for assistive/impairment contexts (Baptista et al., 6 May 2025, Tokmurziyev et al., 13 Jan 2025), with inclusion of SSMI users (Sharma et al., 2020).

Representative Table: System Latency and Performance

System	Latency (excl. dwell)	Grasp/Intent Acc.	Gaze Error
MIHRaGe (Baptista et al., 6 May 2025)	100–150 ms	80–95%	1.3–2.1 cm
RaycastGrasp (Lin et al., 25 Oct 2025)	200–300 ms post-dwell	88–94%	0.05 m
GazeGrasp (Tokmurziyev et al., 13 Jan 2025)	~50 ms per frame	N/A (RTM)	N/A
NG-EG (Wang et al., 2022)	~3 ms gaze	90–95%	≪ 1 cm*

*Reported "low error," but not explicitly quantified.

6. Limitations, Challenges, and Best Practices

The following system-level limitations and recommendations recur across the literature:

Calibration Drift and Depth Ambiguity: Vision-only or 2D projection approaches are limited by calibration stability, absence of depth feedback, and lack of closed-loop servoing for final alignment (Baptista et al., 6 May 2025, Lin et al., 25 Oct 2025). Best practice includes continuous recalibration, depth or stereo integration, and feedback overlays.
Visual Disambiguation and Clutter Handling: Systems reliant on bounding-box inclusion or centroid-mapping struggle in cluttered scenes. Instance segmentation (Lin et al., 25 Oct 2025) and vision-language reasoning (Tay et al., 8 Jan 2026) are promising, though computational cost and inference latency remain open challenges.
Expressiveness and Interaction Bandwidth: Dwell-based selection, while robust, can limit interaction rate and user comfort. Extension to multimodal interfaces (haptic, EMG, blinking) may enhance expressiveness (Tokmurziyev et al., 13 Jan 2025, Sardinha et al., 2024).
Physical GUI Scalability: Physical or diegetic button interfaces are directly intuitive but scale poorly to high-DOF control and risk occlusion (Sardinha et al., 2024).

Recommended architectural best practices include closed-loop feedback, adaptive parameters (dwell time, fixation radius), hybrid planning, and robust continuous calibration (Baptista et al., 6 May 2025, Sardinha et al., 2024).

7. Future Directions

Emerging frontiers and active research topics in gaze-guided robotic manipulation include:

Robotic Self-Gaze and Active Vision: Systems where the robot actively chooses its own gaze fixations in service of manipulation (via RL/BC, e.g., EyeRobot) are yielding emergent hand–eye coordination and superior coverage of large or panoptic workspaces with foveal policies (Kerr et al., 12 Jun 2025, Chuang et al., 21 Jul 2025).
Foundation Models and Semantic Reasoning: Integration of VLMs or foundation models for zero-shot intent inference from gaze, coupled with dynamic skill composition, is demonstrating generalizable, natural, and scalable interfaces, albeit with high inference latency and unresolved issues in fine-grained geometry (Tay et al., 8 Jan 2026).
Generalization and Robustness: Bottleneck-based architectures, out-of-distribution testing, and foveated transformer policies are pushing reusability, precision, and robustness, especially for complex bimanual, high-DOF, and memory-based manipulation (Takizawa et al., 25 Feb 2025, Chuang et al., 21 Jul 2025, Kim et al., 2022).
Assistive and Accessibility Applications: Ongoing efforts are focused on accessibility, low-cost deployment, and the adaptation of gaze-guided interfaces for populations with severe motor impairments, integrating AR, physical GUIs, and multimodal feedback for daily living assistance (Sharma et al., 2020, Sardinha et al., 2024).

This field is rapidly evolving, with a trend toward combining user-intent-rich gaze signals, real-time visual perception, high-level semantic reasoning, and robust closed-loop control to enable natural, generalizable human–robot interaction.