Gaze-Aware Head Importance Metric
- The paper introduces a metric that quantifies ViT attention head alignment with gaze cues by combining baseline importance scores with gaze overlap measurements.
- It details a methodology involving forward pass extraction, patch-level attention averaging, and gaze RoI mapping to compute gaze alignment weights.
- Experimental results on datasets like CUB-GHA show that integrating gaze cues redirects attention dynamics, improving the model's interpretability for object detection.
The Gaze-Aware Attention Head Importance Metric is a quantitative diagnostic tool developed to assess the degree to which individual self-attention heads in Vision Transformers (ViTs) are influenced by human gaze cues. Introduced in the context of gaze-guided object detection for egocentric video, this metric extends the traditional head importance analysis by explicitly incorporating alignment with viewer-attended regions. It enables detailed inspection of how gaze-biased architectural interventions alter the internal dynamics of attention mechanisms, supporting both model interpretability and principled architectural tuning.
1. Motivation and Conceptual Foundations
The central premise is that in many egocentric and task-driven visual contexts, human gaze serves as a privileged supervisory signal, defining regions of primary attentional importance. Standard ViT architectures consist of multiple attention heads per layer, each learning to focus on different spatial patterns—yet which heads become responsive to gaze, and to what degree, remains opaque unless explicitly measured.
The Gaze-Aware Head Importance Metric addresses this opacity by quantifying, for each head , the overlap between its spatial attention allocation and the user’s gaze-defined region-of-interest (RoI). Unlike generic head importance scores—which measure overall activity or dispersion of attention—the gaze-aware variant highlights heads whose focus has become more aligned with human visual intent through gaze-driven model modifications.
2. Mathematical Formalism
Let denote a dataset of video frames, with each frame segmented into non-overlapping patches. In a given self-attention layer with heads, the standard attention weight is : the amount of attention head places on patch from query in frame . Human gaze is represented by normalized coordinates , with indicating whether patch covers the gaze RoI in frame .
The metric comprises two terms:
- Baseline Head Importance:
This reflects the average activity level of head over all frames and patches, following Li et al. [doesattentionworkvision].
- Gaze Alignment Weight:
where
This term quantifies the proportion of attention mass from head that falls within gaze regions.
The Gaze-Aware Head Importance is then given by:
with controlling the weighting of overall activity vs. gaze alignment (empirically chosen as , on CUB-GHA for optimal IoU alignment).
3. Practical Computation Procedure
Computation is executed at inference time, structured as follows:
- Forward Pass Extraction: For each frame , perform a forward pass through the gaze-integrated ViT and extract attention tensors for all heads and layers of interest.
- Patch-Level Attention Averaging: Compute for each patch —the mean attention assigned by head .
- Gaze RoI Mapping: Identify patches overlapping the gaze-centered RoI to produce for all patches in every frame.
- Per-Frame Gaze Alignment: Calculate as the sum of on gaze-overlapping patches.
- Dataset Aggregation: Compute , yielding the head’s average gaze overlap.
- Baseline Importance Calculation: In parallel, compute as the mean attention activity over all patches and frames.
- Final Head Importance Score: Combine using pre-tuned to produce for each head.
All normalizations by $1/L$ and $1/N$ standardize the comparisons across heads, layers, datasets, and model variants.
4. Interpreting Attention Head Dynamics Under Gaze-Bias
Applying the Gaze-Aware Head Importance Metric uncovers which ViT attention heads are most susceptible to gaze-guided modulation. In “Eyes on Target,” comparison of head importance before and after gaze integration at the second attention layer of a DETR variant revealed the following:
| Head Index | (baseline) | (with gaze) |
|---|---|---|
| 1 | 0.56 | 0.68 |
| 5 | 0.41 | 0.54 |
| 6 | 0.65 | 0.62 |
This shift indicates selective amplification: heads focused on user-attended regions (e.g., instrument panels, gauges) increased in importance, while heads attending to less eye-fixated features (e.g., scene contours) decreased. Visualizations of the top 40% attention mass for each head further corroborate that gaze-aware tuning steers some heads toward high-gaze interface elements.
This suggests that the metric can serve not only as a retrospective analytic tool but also as a means to identify and prune or repurpose less gaze-responsive heads in model architectures.
5. Experimental and Hyperparameter Results
On the CUB-GHA dataset (fine-grained bird images with ground-truth fixation maps), hyperparameter sweeps over and maximized the intersection-over-union between aggregated head attention and gaze. The optimal balance (, ) achieved mean IoU=0.71, indicating the metric’s calibration is sensitive and consistent with human-labeled attentional focus.
Quantitatively, heads 1 and 5 exhibited gains of +0.12 and +0.13 points, respectively, in gaze-aware importance, demonstrating that gaze cues selectively enhance heads aligned with critical task regions.
Qualitative overlays in the maritime simulator scenario show that gaze integration—reflected in the metric—correlates with systematic redirection of attention maps toward functionally-relevant interface components.
6. Applications and Implications
The Gaze-Aware Head Importance Metric is directly applicable as a diagnostic and evaluation instrument in vision transformer models subjected to human-attention steering. Its concise scoring system enables identification of the most and least gaze-responsive heads, guiding architectural refinement and fostering interpretability in environments where user visual attention is pivotal.
By revealing the head-level effects of gaze-driven modifications, the metric can inform pruning, head reallocation, or targeted enhancement strategies in model design. Furthermore, its generic formulation—normalizing to patch and sample count—permits head importance comparisons across architectures, layers, and datasets, lending itself to cross-benchmark evaluations and ablations.
A plausible implication is that future architectures for human-in-the-loop or safety-critical perception tasks can incorporate metric-guided feedback to maintain or amplify attention alignment with user intent.
7. Limitations and Future Considerations
While the metric offers transparency into per-head gaze alignment, it rests on the assumption that human gaze is optimal or desirable for the target task—an assumption that may not generalize to all domains. Additionally, the binary RoI mask may oversimplify smooth or peripheral fixations. Extension to probabilistic or temporally-integrated gaze models could address such limitations.
Continued investigation into cross-task and cross-domain robustness of the metric, as well as its suitability for dynamically-adaptive transformer heads, remains a substantive future research direction. Its integration with richer saliency models may further broaden its applicability to diverse real-world tasks involving human visual behavior.