Gaze-Guided Action Anticipation

Updated 8 January 2026

Gaze-guided action anticipation is a computational methodology that leverages human gaze data alongside movement cues to predict actions early.
It fuses high-frequency gaze, head, and body signals using RNNs, encoder-decoder models, and graph neural networks for robust temporal prediction.
This approach is pivotal in applications like collaborative robotics and video understanding, significantly improving prediction speed and accuracy.

Gaze-guided action anticipation refers to computational methodologies and experimental frameworks in which human gaze—whether measured directly via eye tracking or approximated from head orientation—is used as an explicit contextual signal for the early prediction of human actions in collaborative environments, video understanding, and human–robot interaction. The addition of eye gaze provides critical temporal and spatial cues about the agent’s focus, goals, and latent intention, supplementing traditional movement and object interactions. Recent research has demonstrated substantial improvements in anticipation speed and accuracy when gaze cues are incorporated into deep learning, graph neural network, and generative probabilistic models.

1. Gaze Data Acquisition and Representation

Gaze signals are typically captured using high-frequency eye trackers (e.g., Pupil Labs at 60 Hz or Tobii Spectrum at up to 1200 Hz) synchronized with body kinematics (OptiTrack motion capture, RGB video) and object annotations (Canuto et al., 2019, Duarte et al., 2018, Ozdel et al., 2024). In experimental setups where direct gaze measurement is not feasible, head orientation estimated via skeleton keypoints (e.g., OpenPose facial landmarks) can be used as a proxy (Canuto et al., 2019).

For each time frame, features are extracted and concatenated. A typical vector includes:

Gaze: 2D or 3D gaze point, expressed in image pixel coordinates or world coordinates.
Head: Facial keypoints as gaze proxy (usually 10D).
Body and Hands: Upper-limb joint positions (e.g., 14D for shoulders/arms), hand centroids (4D).
Object context: Segmented object centroid (e.g., ball position, 2D). Each is treated as a raw feature map and normalized during preprocessing (Canuto et al., 2019, Schydlo et al., 2018). In video-centric systems, gaze fixations are used to crop spatial patches for further visual embedding with CLIP (Ozdel et al., 2024). The pipeline discards or imputes missing keypoints as required.

2. Model Architectures and Fusion Strategies

Gaze signals are fused with movement and object cues in multi-stream architectures. Three main paradigms have been established:

A. Recurrent Neural Networks (RNNs, LSTMs)

Multi-stream spatial embeddings are constructed for each modality, passed through nonlinear layers (typically ReLU), and concatenated. A two-layer LSTM (hidden size 64 or 20) then temporally encodes the joint vector $x_t$ (Canuto et al., 2019, Schydlo et al., 2018).

$\begin{aligned} e_h &= \text{ReLU}(W_h^T v_h + b_h) \ e_o &= \text{ReLU}(W_o^T v_o + b_o) \ e_m &= \text{ReLU}(W_m^T v_m + b_m) \ e_c &= \text{ReLU}(W_c^T [e_h; e_o] + b_c) \ x_t &= [e_c; e_m] \ h^{(1)}_t &= \text{LSTM}_1(x_t) \ h^{(2)}_t &= \text{LSTM}_2(h^{(1)}_t) \end{aligned}$

Final classification is performed via a fully connected layer with softmax over possible action classes (Canuto et al., 2019).

B. Encoder–Decoder Sequential Models

An encoder network processes the feature sequence, with the encoded context passed to a decoder LSTM to generate variable-length future action sequences. Beam search enables multiple trajectory sampling for stochastic reward estimation (Schydlo et al., 2018).

C. Graph Neural Networks (GNNs)

Visual–semantic graphs are constructed from gaze-centered image crops. Nodes correspond to fixation patch embeddings, edges are defined by temporal transitions enriched with semantic features (object class embeddings). Edge-Conditioned Convolution (ECC) propagates information, with output pooled for intention recognition and hierarchically conditioned action sequence decoding (Ozdel et al., 2024).

3. Uncertainty Modeling in Action Anticipation

Standard deterministic criteria rely solely on the softmax score for action class selection. Recent advances incorporate prediction uncertainty via Bayesian RNNs:

MC Dropout: Random neuron masking at each time step [Gal & Ghahramani 2016].
Variational Dropout: Learnable per-unit dropout rates [Kingma et al. 2015].
Bayes-By-Backprop: Posterior weight distributions [Blundell et al. 2015].

Uncertainty $U$ is computed as

$U = -\sum_{c=1}^d m_c \log m_c + \frac{1}{S} \sum_{s=1}^S \sum_{c=1}^d s_t^{(s)}[c] \log s_t^{(s)}[c]$

where $m_c$ is the predictive mean over $S$ Monte-Carlo samples. Anticipation triggers when $U < u$ , using a threshold tuned for the accuracy–earliness trade-off (Canuto et al., 2019).

4. Impact of Gaze Cues: Quantitative and Experimental Findings

Experimental ablations establish that gaze substantially accelerates and refines action prediction:

Model Variant	Recognition (%)	Anticipation Ratio	Observation Fraction (%)	Dataset
Movement Only	95.2	85	50	Acticipate, 6-class
Movement + Head	100	95.2	40	Acticipate, 6-class
Movement+Head+Obj	100	95.4	19	Acticipate, 12-class
BLSTM-MC (u=0.5)	98.8	98.8	25	Acticipate, 12-class
GazeGNN (full)	61	—	—	VirtualHome, 18-class

In human gating studies, gaze-only cues supported ~50% accuracy (vs. 16.7% chance) on six-way actions and ~85% on spatial goal direction. Additive head and arm cues improved full-action discrimination to near ceiling (Duarte et al., 2018). Similar trends observed in GNN-based models, where gaze-guided node selection boosts intention and action prediction by 7% compared to non-gaze modalities (Ozdel et al., 2024).

5. Evaluation Protocols and Metrics

Standard metrics for action anticipation include:

Accuracy vs. Observation Ratio ( $ACC(r)$ ): Correct identification at fraction $r$ of the trajectory.
Anticipation Accuracy ( $ACC_{act}$ ): Accuracy at first non-null prediction.
Expected Observation Ratio ( $E_{obs}$ ): Fraction of sequence required for correct prediction (Canuto et al., 2019).
Set Intersection over Union (IoU): For unordered action sequences (Ozdel et al., 2024).
Normalized Levenshtein Distance: Order-sensitive sequence similarity (Ozdel et al., 2024).
End-to-End Success Rate (SR): Execution-based metric for completed tasks in simulation environments (Ozdel et al., 2024).

6. Practical Integration in Collaborative and Robotic Systems

In robotic systems, gaze-guided anticipation is used both for intention inference and to produce legible actions. Generative models fuse gaze and kinematic information for real-time decision making: the system triggers robot motion based on early gaze-identified goals, with continuous monitoring enabling dynamic replanning (Duarte et al., 2018). Controllers coordinate arm and gaze actuation to preserve human-like legibility in execution, validated in robot–human interaction studies.

Key design guidelines from recent research emphasize:

Treating gaze as contextual input, not a motor signal.
Fusing gaze and pose in embedding layers of comparable dimension.
Using variable-length RNN architectures for temporal flexibility.
Explicit uncertainty thresholding for robust early prediction.
Comprehensive evaluation across accuracy, earliness, and reliability metrics (Canuto et al., 2019).

7. Limitations and Future Research Directions

Current approaches depend on high-quality, directly measured gaze data—limiting real-world deployment when eye tracking is unavailable or unreliable (Ozdel et al., 2024). Domain adaptation from synthetic to real environmental settings remains an open challenge. Integration of predicted gaze (rather than measured), extension to egocentric or moving-camera scenarios, and joint modeling with language and tactile modalities are cited as key future directions.

A plausible implication is that gaze-guided models will remain critical to collaborative robotics, assistive manufacturing, and embodied video understanding tasks, assuming ongoing advances in gaze estimation and multimodal fusion. Further work is required to statistically model gaze sequence distributions and to generalize to multiactor, multi-object environments.

For additional pipelines and implementation details see:

"Action Anticipation for Collaborative Environments: The Impact of Contextual Information and Uncertainty-Based Prediction" (Canuto et al., 2019)
"Action Anticipation: Reading the Intentions of Humans and Robots" (Duarte et al., 2018)
"Anticipation in Human-Robot Cooperation: A Recurrent Neural Network Approach for Multiple Action Sequences Prediction" (Schydlo et al., 2018)
"Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention" (Ozdel et al., 2024)