Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention (2303.15274v3)

Published 27 Mar 2023 in cs.CV

Abstract: Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural LLM, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.

References (49)

Authors (6)

Sounak Mondal (6 papers)
Zhibo Yang (43 papers)
Seoyoung Ahn (10 papers)
Dimitris Samaras (125 papers)
Gregory Zelinsky (11 papers)
Minh Hoai (48 papers)

Citations (22)

View on Semantic Scholar

Summary

The paper presents a novel ZeroGaze task that enables zero-shot gaze prediction without relying on pre-trained object detectors.
The paper leverages a hybrid method combining ResNet-50 visual features with RoBERTa embeddings to integrate semantic and contextual information.
The paper demonstrates that Gazeformer outperforms existing models with up to 70% improvement and is over five times faster in real-time applications.

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

The paper "Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention" tackles the critical problem of predicting human gaze in interactive systems. Specifically, it introduces a novel task, termed ZeroGaze, which is a variant of zero-shot learning for gaze prediction involving objects that the model has never encountered during training. The proposed solution, Gazeformer, is a new model designed to solve this challenging task by leveraging transformer-based architectures and linguistic embeddings to efficiently predict gaze for unseen target categories.

Key Contributions and Model Architecture

The Gazeformer model stands out in its approach to predict human gaze by moving away from reliance on trained object detectors, which limit scalability and adaptability. Instead, Gazeformer utilizes a natural LLM to encode the search target, thereby harnessing semantic similarities between different objects for improved prediction accuracy. The architecture comprises a transformer-based encoder-decoder setup, wherein:

Image Feature Encoding: Features are extracted using a ResNet-50 backbone, outputting contextual image representations post processing through transformer encoder layers.
Semantic Feature Encoding: Targets are represented using embeddings from RoBERTa, allowing the model to generalize across unseen categories based on linguistic correlations.
Joint Feature Embedding: The visual and semantic features are integrated into a shared multimodal space.
Parallel Scanpath Prediction: The model employs a novel approach to predict simultaneous fixation points and durations through Gaussian distributions, which offers substantial speed advantages over traditional sequential prediction methods.

Empirical Evaluation and Performance

Gazeformer achieves significant performance improvements over existing models across several metrics such as Sequence Score (SS), Fixation Edit Distance (FED), and others in both ZeroGaze and traditional gaze prediction tasks. Notably, Gazeformer outpaces competitive models with a 19% to 70% margin in the ZeroGaze setup. This is attributed to its efficient handling of semantic and contextual information, allowing it to not only predict gaze for previously seen targets but also novel targets without the need for specific training data on them.

Gazeformer also excels in terms of inference speed, being over five times faster than prior state-of-the-art models, which is crucial for deployment in real-time interactive systems.

Implications and Future Directions

The results underscore the potential of language embeddings and transformer architectures in gaze prediction tasks, especially in scenarios where scalability and speed are paramount. By adopting a more general representation of targets, Gazeformer represents a step forward in the integration of gaze prediction in interactive systems, like augmented and virtual reality devices, where fast and reliable user attention modeling is required.

Furthermore, the ZeroGaze task broadens the practical applicability of gaze tracking systems, indicating paths for future research in adapting the model for broader AI tasks, including expansive object referral scenarios and other vision-based tasks. Investigating the extensibility of Gazeformer to embody complex target descriptives via complete language expressions presents a promising avenue.

In conclusion, Gazeformer marks a notable advancement in the field of goal-directed attention modeling, demonstrating superior scalability, effectiveness, and speed, hence presenting significant implications for its use in varied human-computer interaction contexts and rapid adaptation to new or seldom-seen objects in real-world applications.

PDF Markdown

Related Papers

YouTube

Show All Videos