Gaze-Augmented Transformers
- Gaze-Augmented Transformers are models that incorporate human gaze data to modulate attention, tokenization, and information routing for focused processing.
- They employ dual-branch designs, gaze-biased self-attention, and foveated patch tokenization to boost efficiency and performance across vision, IR, and robotics tasks.
- Empirical results demonstrate quantifiable gains—in metrics like mAP, mIoU, and AUC—while reducing computational costs through targeted resource allocation.
Gaze-augmented Transformers are transformer architectures in which human gaze data, simulated gaze processes, or computational gaze proxies are used to explicitly modulate attention, tokenization, or information routing. This approach exploits the biological principle that human visual attention is allocated non-uniformly, with rapid glances for global scene layout and prolonged gaze for local detail or object-level semantics. Gaze augmentation has been operationalized in a variety of domains including vision, vision-language, information retrieval, and robotics. Techniques include architectural dual-branch designs that parallel human glance-and-gaze strategies, direct injection or fusion of gaze features into self-attention, foveated (gaze-centered) patch tokenization, as well as vision-language prompting and cross-attention. Gaze-augmented Transformers have yielded quantifiable improvements in object detection, gaze following, gaze object prediction, and robot policy efficiency.
1. Motivations and Theoretical Principles
The principal motivation for gaze-augmentation in transformers arises from the inefficiency and sometimes suboptimal inductive bias of uniform self-attention, especially in settings where spatially-local visual context or task-relevant regions are known or can be inferred. Human vision is characterized by sparse, prioritized resource allocation: a rapid global "glance" encodes scene layout, while "gaze" targets fine details at behaviorally critical locations. This dual-process model instantiates (1) long-range dependency modeling without quadratic computational cost, and (2) enhanced local context for feature discrimination. The "Glance-and-Gaze Vision Transformer" (GG-Transformer) operationalizes these principles via two parallel branches (Yu et al., 2021): the glance branch employs adaptively-dilated partitioned self-attention for O(N) global context, while the gaze branch uses depth-wise convolution for local contextual recovery missed by partitioning.
In information retrieval, gaze signals provide access to human cognitive saliency, permitting transformers to more closely align their attention mechanisms with those regions or tokens that inform downstream relevance or comprehension (Dong et al., 2022). In embodied and egocentric learning, human gaze during naturalistic interaction serves as a supervisory signal or structural prior for policy network visual processing (Chuang et al., 21 Jul 2025, Lall et al., 3 Nov 2025).
2. Architectural Strategies for Gaze Integration
Gaze-augmented Transformers employ several architectural patterns:
- Dual-branch modeling: The GG-Transformer (Yu et al., 2021) forms two concurrent branches: glance, using partitioned self-attention for global information, and gaze, using inexpensive depth-wise convolution for local features. The outputs are fused and passed through standard MLP and normalization routines.
- Gaze-biased self-attention: In "Eyes on Target" (Lall et al., 3 Nov 2025), explicit gaze bias terms are injected into the attention logit matrix: attention between tokens i and j is modified as
where GazeBias is a continuous function that peaks near the gaze fixation and incorporates gaze direction. This directly biases the transformer to weight gaze-proximal features.
- Foveated patch tokenization: In robot learning, rather than extracting a uniformly dense grid of image patches, the model segments the image into concentric rings around a predicted or human gaze fixation, using fine patches at the center and coarser tokens in the periphery. This dramatically reduces the number of tokens to which self-attention is applied (e.g., ~20 vs. ~324) while retaining detail near the gaze (Chuang et al., 21 Jul 2025).
- Gaze-object cross-attention: The "TransGOP" model introduces an object-to-gaze cross-attention mechanism, wherein object-detection tokens attend to gaze regressor tokens through adapter layers, transmitting spatial memory to improve heatmap localization (Wang et al., 2024).
- Token, input, or attention modulation via gaze: In vision-LLMs (e.g., GazeVLM (Mathew et al., 9 Nov 2025)), gaze cues—face boxes, gaze regions, etc.—are encoded in the prompt and support selective execution of detection, regression, and object identification tasks through cross-modal attention fusion.
- Gaze-informed MaxSim and late-fusion: In NLP, human gaze fixation predictions act as priors modulating either the scaled dot-product in late transformer layers or the MaxSim operation for token-wise query-document interaction (Dong et al., 2022).
- Multi-person token integration: Transformer-based gaze architectures for gaze following (e.g., Sharingan (Tafasca et al., 2023)) concatenate per-person gaze tokens (derived from head crops and bounding boxes) with patch tokens, allowing the model to reason about mutual gaze and social attention through full self-attention.
3. Mathematical Formulation and Complexity Analysis
A distinguishing property of several gaze-augmented designs is the ability to decouple global and local attention at reduced computational cost. In the GG-Transformer, the glance branch operates on O(N/M²) partitions of size M², with each partition attending over M² tokens:
while the gaze branch introduces only additional cost via depth-wise convolution, preserving linear scaling in N (Yu et al., 2021).
In foveated ViTs, reducing token count from to drives down the self-attention FLOPs quadratically:
as demonstrated by a 3.8×10⁻³ ratio in (Chuang et al., 21 Jul 2025). In practice, a 16× reduction in GFLOPs and a 94% reduction in token count were measured.
In gaze-modulated attention (e.g., (Lall et al., 3 Nov 2025, Dong et al., 2022)), the self-attention computation is augmented:
where is a spatially-structured bias determined by gaze coordinates or predicted fixation scores.
4. Applications and Empirical Results
Gaze-augmented Transformers have been validated in multiple domains:
- Image classification: GG-Transformer achieved +0.8% top-1 accuracy on ImageNet-1K compared with Swin-T for equal parameter and FLOP budgets (Yu et al., 2021).
- Semantic segmentation: GG-T on ADE20K exceeded Swin-T mIoU by 1.9% in single-scale evaluation (Yu et al., 2021).
- Object detection: Incorporating gaze into DETR frameworks ("Eyes on Target") generated consistent mAP and F1 gains on egocentric datasets (e.g., +0.03 accuracy on Ego-CH-Gaze over gaze-agnostic DETR) (Lall et al., 3 Nov 2025).
- Gaze following and gaze object/target detection: Sharingan (Tafasca et al., 2023), Object-aware GOT (Tonini et al., 2023), and TransGOP (Wang et al., 2024) reported state-of-the-art AUC, L2 distance, and AP on GazeFollow, VAT, and GOO benchmarks—e.g. TransGOP improved mSoC by 24.9% on GOO-Synth over prior SOTA (Wang et al., 2024).
- Neural IR and NLP: GazBy outperformed MonoBERT and ColBERT in certain TREC DL settings, with P@10 and nDCG@10 improvements of +1.1% and +2.0% in cross-encoder mode (2019 track) (Dong et al., 2022).
- Robot policy learning: Foveated ViTs trained with human gaze reduced computation per step by 16×, cut policy memory requirements (4.0 GB vs. 20.9 GB), and improved or matched success rates on high-precision and distractor-rich tasks on AV-ALOHA (Chuang et al., 21 Jul 2025).
A table summarizing representative tasks and primary mechanisms:
| Domain | Gaze Mechanism | Notable Result | Reference |
|---|---|---|---|
| Vision | Glance-and-Gaze dual-branch | Linear SA, mIoU↑, AP↑ | (Yu et al., 2021) |
| Gaze Following | Multi-person tokens, fusion | AUC=.938, multi-person support | (Tafasca et al., 2023) |
| Object Det. | Gaze-biased attention | mAP/F1 gain on ego datasets | (Lall et al., 3 Nov 2025) |
| IR/NLP | Gaze-injected self-attention | nDCG/P@10 improvement | (Dong et al., 2022) |
| Robotics | Foveated ViT tokenization | 16× compute ↓, SOTA success | (Chuang et al., 21 Jul 2025) |
| VLM | Prompted gaze fusion | AP_{ob}=SOTA, AUC=.929 | (Mathew et al., 9 Nov 2025) |
5. Evaluation Metrics and Ablation Insights
Evaluation spans image-level and sequence-level metrics as relevant to the task. Typical image metrics include AUC for gaze heatmap overlap, L2 distance between predicted and ground-truth gaze points, mean/minimal pixel distance, angular error, mean average precision (mAP) at IoU=0.5 for gaze-object localization, and task-specific success rates for reinforcement or imitation learning.
Critical ablation findings include:
- In Object-aware GOT, the introduction of gaze-object transformer self-attention with explicit bias delivered a +6pp AUC improvement over non-gaze transformer, with further gains from head–object masking (Tonini et al., 2023).
- In GazBy, injection point ablations showed strong performance collapse if gaze priors were injected outside the q–d interaction or late fusion stage (Dong et al., 2022).
- Foveated ViTs consistently reduced token budget and latency without degrading visual recognition performance, especially on tasks requiring precise manipulation or focus (Chuang et al., 21 Jul 2025).
- Cross-attention directionality (object-to-gaze in TransGOP) produced mSoC gains of +13.4% over inverse gaze-to-object schemes (Wang et al., 2024).
6. Broader Implications and Trends
The integration of gaze into transformer architectures serves as a template for broader biologically-inspired structural priors in deep learning. Gaze-augmentation enhances efficiency by concentrating parameter and compute resources near task-salient locations. The strategy generalizes across modalities and tasks, suggesting extensibility to other forms of attention-of-interest (e.g., via language, action intent, or afferent perceptual signals) beyond human eye-tracking.
Vision-LLMs (e.g., GazeVLM) illustrate that explicit prompt-based gaze annotation can orchestrate unified multi-task reasoning for detection, regression, and classification (Mathew et al., 9 Nov 2025). Vision transformers with foveated or partitioned schemes maintain or improve performance as spatial resolution increases, circumventing O(N²) scaling penalties. The modularity of gaze encoding—via features, biases, or tokens—enables its seamless integration with new transformer blocks or attention variants.
A plausible implication is that future transformer-based systems for physical interaction, human-robot communication, medical imaging, and AR/VR will routinely exploit both real and simulated gaze signals for inductive bias, interpretability, and sample-efficient adaptation.
7. Open Challenges and Future Directions
Open challenges in gaze-augmented transformers include learning deformable or dynamic partitioning patterns (e.g., learnable adaptive dilation in GG-Transformer), integrating multiple gaze sources (e.g., from multi-agent views), and scaling architectures for real-time and persistent attention tracking in video. In robotics and embodied agents, designing policies that can learn not only to imitate but to optimize their own visual gaze for task performance remains an unsolved research direction (Chuang et al., 21 Jul 2025).
There is active exploration of integrating additional modalities (e.g., depth via HHA encoding (Mathew et al., 9 Nov 2025), pupil diameter, gaze cone modeling) and more advanced forms of prompt-based supervision. Expanding annotated datasets that capture ecological gaze in complex environments will be essential for further progress. The continuing convergence of efficient transformer architectures, biological insights, and high-fidelity behavioral data is expected to drive innovation in both model topology and application breadth for gaze-augmented transformers.