Gaze-Based Controls in HCI

Updated 14 September 2025

Gaze-based controls are interaction modalities that use eye movements to indicate user intent and facilitate selection in digital and physical environments.
Modern systems integrate dedicated and webcam-based eye tracking, advanced calibration, and filtering techniques to achieve sub-degree accuracy and real-time responsiveness.
Applications span assistive technology, augmented/virtual reality, robotics, aviation, and medical imaging, providing hands-free control and reducing user fatigue.

Gaze-based controls are a class of human-computer interaction (HCI) modalities in which eye gaze is used as an explicit or implicit input to direct, select, or manipulate digital or physical systems. The approach leverages the human visual system's ability to naturally indicate attention, intention, or selection targets, integrating eye tracking hardware and software into diverse user interfaces ranging from ambient intelligence environments and assistive robotics to augmented/virtual reality (AR/VR, XR), flight controls, and interactive content generation. Below, the foundational concepts, principal algorithms, user interface paradigms, application domains, technical challenges, and future research opportunities in gaze-based controls are summarized, strictly synthesized from primary research literature.

1. Core Principles and Modalities of Gaze-Based Control

Gaze-based interaction modalities can be broadly classified into explicit and implicit controls. In explicit gaze-based control, the user's point of regard directly determines the system's input, such as pointing, selection, or command activation through gaze fixations, dwell time, crossing gestures, or multimodal gaze-combined triggers (0708.3505, Rajanna et al., 2018, Tokmurziyev et al., 12 Jul 2024). Implicit gaze-based control infers the user's internal cognitive or affective state, intent, or error occurrence from natural gaze behavior, enabling context-aware or predictive system adaptations (Sendhilnathan et al., 22 May 2024).

Two additional distinctions clarify gaze-driven input schemes:

Primary control: Gaze acts as the main or only input modality, suitable for users with motor or speech impairments or for hands-free scenarios (Sharma et al., 2020, Tokmurziyev et al., 13 Jan 2025).
Supplementary control: Gaze disambiguates or refines multimodal commands, as in gaze-speech systems where gaze spatially resolves deixis ("that", "there") in spoken utterances (0708.3505, Rajanna et al., 2018).

Table 1: Summary of Gaze Control Approaches | Approach | Mechanism | Example Use | |--------------|----------------------------|----------------------------| | Dwell-time | Fixating for T > threshold | Map zoom, text entry | | Crossing | Gaze moves across boundary | Circular menus, selection | | Gaze+Foot | Gaze = pointer + foot acts | Point-and-click, typing | | Gesture | Saccade/scanpath patterns | Authentication, shortcuts | | Implicit | Unconscious gaze metrics | Error detection, workload |

2. Sensing, Tracking, and Calibration Techniques

Modern gaze-based control systems deploy either dedicated eye-tracking hardware (infrared, near-infrared stereo, or monocular cameras) or software-based approaches leveraging standard RGB webcams (Sharma et al., 2020, Sharma et al., 2020, Tokmurziyev et al., 13 Jan 2025). Hardware systems may achieve sub-degree accuracy (<0.5° visual angle) with robust calibration, supporting real-time fixation detection without constraining the user's head (0708.3505, Murthy et al., 2020). Webcam-based methods use facial landmark extraction, Histogram of Oriented Gradients (HoG), deep learning (e.g., OpenFace CLNF), or appearance-based CNNs to estimate (x, y) gaze positions with average errors ranging from 1.8–6 cm on consumer hardware (Sharma et al., 2020, Sharma et al., 2020). For 3D gaze estimation, systems integrate monocular or stereo eye tracking with RGB-D or depth sensors, mapping gaze rays into scene coordinates by fusing camera pose (EPnP+RANSAC), head orientation, and scene reconstruction (Wang et al., 2018, Shafti et al., 2018).

Calibration methods vary from multi-point fixation routines (9- or 35-point polynomial mapping (Tokmurziyev et al., 13 Jan 2025)) to single-point quick calibration strategies (Zeng et al., 2022). For gaze depth/vergence estimation, parameters such as inter-pupillary distance (IPD) are mapped to real-world depth via calibrated regression functions (Wang et al., 2022).

Gaze sample quality (e.g., 60–240 Hz sampling), filtering (Kalman/Medians/Bezier), and dwell/crossing event detection are critical for real-time user feedback and control stability (0708.3505, Tokmurziyev et al., 13 Jan 2025, Sardinha et al., 8 Jan 2024).

3. Gaze-Based User Interface Paradigms

Gaze-based UI paradigms address challenges including:

The Midas Touch problem: Mitigating unintended activations from casual gaze fixations (0708.3505, Rajanna et al., 2018).
Disambiguation and precision: Distinguishing between searching (exploratory) and selecting (intentional) gaze, and supporting fine motor control.

Key interface mechanisms include:

Dwell-time selection: Users select items by maintaining fixation beyond a set threshold (e.g., 500 ms–2 s) (0708.3505, Zeng et al., 2022). Optimization of dwell time (~0.8–1.0 s) balances speed and false detection rate (Zeng et al., 2022).
Crossing-based selection: Command activation upon gaze crossing out of an item’s boundary, leveraging the natural saccadic eye movement for faster, less fatiguing interactions. Crossing is particularly efficient for experienced users and mitigates dwell-related fatigue (Riou et al., 7 Feb 2025).
Hierarchical/clustered menus: Used in public kiosks and vending interfaces to spatially separate actionable regions, reducing accidental selections (Zeng et al., 2022).
Gaze-contingent displays: Dynamic adjustment of display resolution or focus based on gaze (e.g., foveated rendering with high-resolution regions tied to gaze coordinates) (0708.3505).
Multimodal integration: Separating pointing (gaze) from activation (e.g., foot press) significantly reduces false positives and approximates the precision of mouse interactions (Rajanna et al., 2018).
Gesture recognition: Scanpath-based gestures (sequences of fixations/saccades) for authentication and shortcut command triggering, distinguished from random eye movement by pattern-matching algorithms or DTW (Rajanna et al., 2018).
Diegetic GUIs: Embedding fiducial-marked, physical interface elements in the robot workspace for direct gaze activation without context switching to screens (Sardinha et al., 8 Jan 2024).
Gaze-vergence and depth: Depth-aware gaze interaction in AR for controlling see-through windows or focusing on occluded layers (Wang et al., 2022).

4. Applications and Empirical Results

Gaze-based control systems are applied in diverse domains:

Assistive Technologies: Gaze-driven robotic arms using point-and-fixate interfaces achieve functional independence for users with motor disabilities, with reported 3D gaze estimation errors of 1.28–4.68 cm and pick-and-place success up to 96-100% depending on system architecture (Wang et al., 2018, Shafti et al., 2018, Tokmurziyev et al., 13 Jan 2025, Baptista et al., 6 May 2025). Magnetic snapping (cursor snapping to detected object centers) reduces average gaze alignment times by 31% (Tokmurziyev et al., 13 Jan 2025). Webcam-based systems reduce input hardware costs, enabling median action selection times of 2–4 s in SSMI populations (Sharma et al., 2020).
Ambient Intelligence & VR Displays: Gaze augments or substitutes pointing for large digital maps, immersive reality caves, or public kiosks, with real-time dispersion-based fixation detection providing robust user interaction (0708.3505, Zeng et al., 2022). Gaze-contingent displays with <70 ms reaction times dynamically adjust foveal regions, optimizing bandwidth and user focus (0708.3505).
Collaborative and Mobile Robotics: AR-enabled, gaze-based robot navigation employing admittance control allows hands-free, efficient path planning, with time-to-destination varying according to virtual damping and mass parameters (Lee et al., 2023). Diegetic graphical interfaces (screenless, marker-based interaction) yield YCB protocol scores averaging 13.71/16 and a mean SUS of 75.36 (Sardinha et al., 8 Jan 2024).
Military Aviation: Eye gaze controls achieve selection times below 2 s and error rates <5% up to +3G acceleration, with adaptive and head-compensated gaze compensation (using IMUs) outperforming joystick TDS by roughly 30% in mean selection time (Murthy et al., 2020).
Interactive Content Generation: Lightweight models (e.g., DFT Gaze with 281K params) facilitate real-time personalized gaze-driven editing of images and videos, with mean angular gaze error of 2.14–7.82° and sub-500 ms latency on embedded systems (Hsieh et al., 7 Nov 2024).
Medical Imaging: Diffusion models conditioned on both radiomic features and radiologist gaze patterns produce synthetic X-rays with improved anatomical realism and diagnostic utility, outperforming text-only T2I synthesis in both FID/SSIM and downstream disease classification AUC (Bhattacharya et al., 1 Oct 2024).

5. Technical Limitations and Mitigation Strategies

Salient challenges and practical constraints include:

Robustness to user variability: Eye-tracker calibration, head-pose compensation, and hardware-agnostic architectures are necessary for adaptation to different users and scenarios (Sharma et al., 2020, Sardinha et al., 8 Jan 2024).
Fatigue and visual stress: Prolonged dwell-based interaction leads to fatigue, which is partly alleviated by hierarchical menus, crossing interactions, and multimodal control schemes (Zeng et al., 2022, Riou et al., 7 Feb 2025).
Spatial accuracy limits: Hardware errors (e.g., >5° visual angle at high G-loads or challenging lighting) necessitate design adaptations—wider vertical FOV for displays, clustering to offset selection jitter, and minimum target sizing based on angular accuracy formulas (e.g., $objSize = 2 \times D \times \tan(\text{angle}/2)$ ) (Murthy et al., 2020, Riou et al., 7 Feb 2025).
Mitigating Midas Touch: Reliable selection mechanisms (gaze+foot trigger, crossing, dwell filtering, dwell time/warning cues) reduce unwanted activations (0708.3505, Rajanna et al., 2018, Riou et al., 7 Feb 2025). Snapping and object detection further disambiguate intent (Tokmurziyev et al., 13 Jan 2025).
Feedback and accessibility: Real-time MR overlays, confirmatory cursors, and gaze-projected menus provide users with action feedback, reducing uncertainty and improving trust (Baptista et al., 6 May 2025, Zeng et al., 2022).

6. Future Directions and Research Opportunities

Emerging directions in gaze-based controls are:

Implicit gaze utilization: Advanced XR systems are moving toward leveraging spontaneous gaze data for intent inference, error detection (<600 ms), and adaptive UI generation rather than explicit selection alone (Sendhilnathan et al., 22 May 2024).
Personalization and adaptability: Knowledge distillation and lightweight gaze models provide fast real-time inference with minimal user calibration, supporting broad deployment on commodity edge devices (Hsieh et al., 7 Nov 2024).
Multimodal systems: Continued research is examining combinations of gaze with speech, gestures, foot input, and EMG signals to create robust, disambiguated, and context-sensitive interfaces (Rajanna et al., 2018, Sardinha et al., 8 Jan 2024).
Generalization across domains: Cross-platform, hardware-agnostic solutions are extending gaze-based control to domains including gaming, collaborative robotics, industrial automation, teleoperation, and medical imaging—with new application-specific requirements for latency, accuracy, and hybrid feedback (Sharma et al., 2020, Bhattacharya et al., 1 Oct 2024).
Ethical and accessibility considerations: Inclusive UI design, fatigue mitigation, and secure authentication via scan-paths (resistant to shoulder-surfing) remain active concerns (Rajanna et al., 2018).

7. Empirical Performance and Impact

Quantitative evaluations consistently report that gaze-based controls, when properly configured and combined with auxiliary input or robust error mitigation, match or exceed the efficiency of traditional modalities (joystick, mouse) in time-sensitive tasks, with user satisfaction (SUS), low perceived workload (NASA-TLX), and error rates within operationally acceptable ranges (Rajanna et al., 2018, Wang et al., 2018, Sardinha et al., 8 Jan 2024, Lee et al., 2023, Zavichi et al., 2 Apr 2025). For example, gaze-based robot steering in collaborative AR achieves similar usability as joystick control (SUS 73–80), with time-to-target preferentially reduced under optimized dynamic/control parameters (Lee et al., 2023). In public kiosk scenarios, selection times under 7 seconds and error rates are minimized by tuning dwell thresholds (Zeng et al., 2022). For motor-impaired populations, hands-free interfaces demonstrate similarly high performance gains and accessibility improvements. Implicit gaze use and advanced multimodal paradigms are expected to further reduce error, cognitive load, and fatigue, driving the adoption of gaze-based controls as a standard interaction paradigm within complex digital and physical systems.