Papers
Topics
Authors
Recent
2000 character limit reached

Gaze-guided Reasoning Navigator (GRN)

Updated 4 December 2025
  • GRN is a computational framework that leverages gaze data to decode intentions and guide actions across navigation and diagnostic tasks.
  • It integrates deep vision models, gaze-to-region mapping, and multi-stage decision policies to achieve robust performance in real-time settings.
  • Empirical evaluations demonstrate improved obstacle avoidance, reduced navigation errors, and enhanced diagnostic accuracy across assistive, robotic, and medical domains.

The Gaze-guided Reasoning Navigator (GRN) is a class of computational modules that leverage overt or simulated visual attention for goal-directed navigation and visual reasoning. Originally arising in assistive mobility platforms, GRN frameworks have evolved into a family of highly structured behavioral and decision modules. These systems utilize gaze data or gaze proxies to decode intentions, plan actions in complex spatial environments, and structure visual reasoning analogous to expert human strategies. Notable instantiations of GRN span autonomous wheelchair interfaces (Subramanian et al., 2021), foundation-model-based robotic navigation (Zhu et al., 12 Jul 2024), and gaze-driven medical diagnostics (Zhu et al., 27 Nov 2025).

1. Foundational Architectures and Design Paradigms

GRN implementations share core architectural elements: extraction of perceptual features (typically via deep vision models), mapping of explicit or inferred gaze signals to candidate goals or regions, classification or scoring of intention or saliency, and navigation or reasoning policies guided by these results.

In early systems for mobility platforms, the input is a real-time egocentric RGB or RGB-D video stream, synchronized with high-frequency binocular gaze estimates from calibrated eye trackers. Object detectors (e.g., YOLOv3/Darknet-53 backbones) annotate visual fields with classes and bounding boxes after pre-processing, non-maximum suppression, and temporal smoothing using sliding windows for robustness to head movement (Subramanian et al., 2021).

In medical settings or VLM-based navigation applications, GRN expands into multi-stage architectures. MedEyes (Zhu et al., 27 Nov 2025) decomposes the visual focus process into a lightweight navigator maintaining a ternary attention state ψt=(Rt,Ct,Ft)\psi_t = (\mathcal{R}_t, \mathcal{C}_t, \mathcal{F}_t), where Rt\mathcal{R}_t are region proposals, Ct\mathcal{C}_t are confidences, and Ft{global,local}\mathcal{F}_t \in \{\text{global}, \text{local}\} is the exploration mode. The logical flow consists of region proposal, confidence estimation, and mode control, serializing structured reasoning steps.

2. Gaze-to-Region and Intention Decoding

The central function of GRN is mapping gaze or its proxy onto candidate objects/regions and decoding whether attention reflects genuine goal intent or incidental observation.

In mobility platforms, normalized gaze coordinates are mapped onto bounding boxes, yielding a 2-D feature inside each object region: ginorm=(gxnorm,gynorm)g_i^{\mathrm{norm}} = (g_x^{\mathrm{norm}}, g_y^{\mathrm{norm}}). Intention is then classified using Fine Gaussian SVMs (TV) or weighted KNNs (laptop, chair), distinguishing between non-interactive and interactive (motor-imagery) fixations with cross-validation accuracies exceeding 84% for object classes (Subramanian et al., 2021). A temporal smoothing ring buffer (N=40) robustly aggregates frame-level predictions.

In MedEyes (Zhu et al., 27 Nov 2025), gaze is instantiated via region-wise scanpaths from human experts. These scanpaths are quantized as bounding boxes with associated confidence, serialized for storage in an off-policy replay buffer. The GRN alternates between proposing and refining regions, with transitions governed by normalized confidence gain Δc=(ct+1(ri)ct(ri))/(ct(ri)+ϵ)\Delta c = (c_{t+1}(r_i) - c_t(r_i)) / (c_t(r_i) + \epsilon), where a threshold δ=0.15\delta = 0.15 determines mode switching. This dual-mode regime models the alternation between broad visual search (scanning) and focused analysis (drilling).

Once intention is decoded or the target region is identified, GRN transfers this goal information to the navigation or reasoning stack.

In the wheelchair use case, once an "interactive" gaze is detected and a voluntary confirmation gesture (wink) is received, the gaze position is lifted to 3-D coordinates and issued to the navigation stack. Navigation operates via a conventionally layered architecture: 2D-SLAM (e.g., GMapping), global path planning via Dijkstra on occupancy grids, local planning via the Dynamic Window Approach, and low-level velocity command fusion (Lidar, RGB-D) at ~10 Hz (Subramanian et al., 2021).

In vision-LLM-driven navigation (Navi2Gaze/GRN (Zhu et al., 12 Jul 2024)), the system constructs candidate goal regions via geometric analysis of pointclouds and grid discretization around a target. GPT-4V is prompted to recursively score pose candidates, integrating semantic common sense with geometric feasibility. The final target pose (x,y,θ)(x^*,y^*,\theta^*) is selected via maximizing s(Ri;q)s(R_i;q)—the VLM’s candidate-specific score—subject to spatial constraints and collision margin. Navigation proceeds via A* planning to (x,y)(x^*,y^*), then orientation and gaze alignment.

In MedEyes, GRN provides structured, multi-round trajectories for reinforcement learning. These sequences are instrumentalized as off-policy expert data in the GRPO (Generalized Reinforcement Policy Optimization) surrogate objective, enabling integration of on-policy and expert-derived advantage computations.

4. Learning with Gaze-Guided Trajectories and Policy Optimization

GRN's influence on learning policy is maximized by structuring, serializing, and leveraging gaze-guided or region-guided trajectories.

MedEyes (Zhu et al., 27 Nov 2025) builds a mixed-policy optimization framework where each gaze-guided trajectory τexpert\tau^{\text{expert}} is assigned a composite, verifiable reward

R(τ)=λaccracc(τ)+λgrammarrgrammar(τ)+λdivrdiv(τ),R(\tau) = \lambda_{\text{acc}}\,r_{\text{acc}}(\tau) + \lambda_{\text{grammar}}\,r_{\text{grammar}}(\tau) + \lambda_{\text{div}}\,r_{\text{div}}(\tau),

which incorporates prediction accuracy, reasoning grammar, and region/chain diversity. Off-policy contributions are weighted by the ratio of the current policy’s path probability to that of expert policy ρiθ\rho_i^\theta. Advantage normalization is performed for both on- and off-policy sources, and learning proceeds akin to PPO, but with GRN’s structure present only in trajectory generation.

The Confidence Value Sampler (CVS) further diversifies training by performing nucleus sampling (p0=0.9p_0=0.9) on region confidences. It terminates trajectory extensions on threshold crossing (ξ=0.85\xi=0.85) or step cap (Tmax=4T_{\max}=4).

5. Evaluation and Empirical Results

GRN-enabled systems demonstrate clear quantitative and qualitative improvements across divergent domains.

On mobility platforms, GRN delivers navigation with static obstacle avoidance success rates of 95%±595\%\pm5, dynamic obstacle avoidance 90%±6.190\%\pm6.1, a mean goal stopping error of 24±9.824\pm9.8 cm, and near-zero emergency stops (Subramanian et al., 2021). Real-time decoder validation shows interactive intention decoding accuracy ~80% and non-interactive accuracy ~95%.

For open-vocabulary robotic navigation, Navi2Gaze (GRN) (Zhu et al., 12 Jul 2024) achieves $0.57$ success rate (SR), $0.29$m mean distance to goal (DTG), and 1919^\circ orientation to goal (OTG)—all substantially surpassing prior methods. In challenging side-proximal starts, it retains SR and halves path length (SPL) and DTG compared to competitors. Ablations show a drop of 9-\sim9 pp SR when core task-space reconstruction and VLM scoring are removed.

MedEyes (Zhu et al., 27 Nov 2025), with full GRN, attains 71.5% average accuracy on main medical VQA tasks (compared to 62.8% without GRN), showing 8.7 pp ablation drop. Single-mode variants perform 4.9–5.9 pp lower, establishing that only the dual-mode, confidence-switching design realizes full gains. Training curves further show accelerated reward convergence and reasoning chain compactness only when guided-off-policy GRN trajectories are included.

Domain Implementation Core GRN Mechanism Key Quantitative Result
Assistive Navigation (Subramanian et al., 2021) Gaze-to-object intent decoding 95% static, 90% dynamic avoid
Open-vocab Robotics (Zhu et al., 12 Jul 2024) VLM-guided pose/gaze ranking $0.57$ SR, $0.29$m DTG
Medical Reasoning (Zhu et al., 27 Nov 2025) Dual-mode scan/drill, expert replay 8.7 pp drop if GRN ablated

6. Future Directions and Extensions

GRN frameworks are modular, supporting extensibility across modalities and application domains. Re-training the object detector, expanding the action or object vocabulary, or extending the intention decoder with advanced models (Gaussian Process, deep networks) can enhance flexibility (Subramanian et al., 2021). Integration of alternative 3-D gaze estimators, region proposal techniques, and policy architectures is straightforward.

In the medical domain, simulation of expert scanpaths and their incorporation into multitask RL frameworks demonstrates the generality of gaze-driven navigation for both real and simulated attention. A plausible implication is that future systems will escalate in complexity and domain generality as larger behavioral datasets and more powerful foundation models proliferate.

7. Significance and Cross-Domain Synthesis

The Gaze-guided Reasoning Navigator paradigm unifies overt eye-tracking, simulated attention, language-driven goal inference, and reinforcement learning into a reproducible, modular policy structure. Its guiding principle is that goal-directed attention—overt or inferred—optimally constrains reasoning and action in ambiguous or high-dimensional task spaces. GRN bridges cognitive-level interfaces (requiring only the communication of intention), geometric control, and semantically informed action selection, with empirical validation across assistive navigation, open-vocabulary manipulation, and interpretable medical diagnosis (Subramanian et al., 2021, Zhu et al., 12 Jul 2024, Zhu et al., 27 Nov 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Gaze-guided Reasoning Navigator (GRN).