Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention (2303.15274v3)

Published 27 Mar 2023 in cs.CV

Abstract: Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural LLM, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. A comparison of scanpath comparison methods. Behavior research methods, 47(4):1377–1392, 2015.
  2. Assessing visual search performance using a novel dynamic naturalistic scene. Journal of Vision, 21(1):5–5, 2021.
  3. Eye gaze sequence analysis to model memory in e-education. In International Conference on Artificial Intelligence in Education, 2019.
  4. State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1):185–207, 2013.
  5. What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018.
  6. End-to-end object detection with transformers. In European conference on computer vision. Springer, 2020.
  7. Transformer-based long-term viewport prediction in 360° video: Scanpath is all you need. In 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2021.
  8. Predicting human scanpaths in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  9. Coco-search18 fixation dataset for predicting goal-directed attention control. Scientific reports, 11(1):8776, 2021.
  10. Characterizing target-absent human attention. In Proceedings of CVPR International Workshop on Gaze Estimation and Prediction in the Wild, 2022.
  11. Student performance prediction with eye-gaze data in embodied educational context. Education and Information Technologies, 28(1):833–855, 2023.
  12. The static and dynamic analyses of drivers’ gaze movement using vr driving simulator. Applied Sciences, 12(5):2362, 2022.
  13. Scanmatch: A novel method for comparing fixation sequences. Behavior research methods, 42(3):692–700, 2010.
  14. It depends on how you look at it: Scanpath comparison in multiple dimensions with multimatch, a vector-based approach. Behavior research methods, 44(4):1079–1100, 2012.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  16. Write a classifier: Zero-shot learning using purely textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, 2013.
  17. Gaze gesture based human robot interaction for laparoscopic surgery. Medical image analysis, 44:196–214, 2018.
  18. A minimal model for predicting visual search in human-computer interaction. In Proceedings of the SIGCHI conference on Human factors in computing systems, 2007.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  20. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
  21. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998.
  22. Assessing the contribution of color in visual attention. Computer Vision and Image Understanding, 100(1-2):107–123, 2005.
  23. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  24. Arett: augmented reality eye tracking toolkit for head mounted displays. Sensors, 21(6):2234, 2021.
  25. Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation. IEEE Robotics and Automation Letters, 6(2):1630–1637, 2021.
  26. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  27. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9):4446–4456, 2017.
  28. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045, 2014.
  29. Understanding low-and high-level contributions to fixation prediction. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
  30. Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707–710, 1965.
  31. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  32. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  33. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.
  34. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970.
  35. Enhancing patient freedom in rehabilitation robotics using gaze-based intention detection. In 2013 IEEE 13th International Conference on Rehabilitation Robotics (ICORR). IEEE, 2013.
  36. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.
  37. Components of bottom-up gaze allocation in natural scenes. Journal of Vision, 5(8):692–692, 2005.
  38. Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. In Asian Conference on Computer Vision. Springer, 2018.
  39. Machine learning predicts responses to conceptual questions using eye movements. In Proceedings of the Physics Education Research Conference, 2018.
  40. Gaze-driven placement of items for proactive visual exploration. Journal of Visualization, 25(3):613–633, 2022.
  41. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 2021.
  42. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  43. Wearable eye tracking for mental health monitoring. Computer Communications, 35(11):1306–1311, 2012.
  44. Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  45. Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  46. Target-absent human attention. In European Conference on Computer Vision, 2022.
  47. Benchmarking gaze prediction for categorical visual search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  48. Predicting goal-directed attention control using inverse-reinforcement learning. Neurons, behavior, data analysis and theory, 2020.
  49. Finding any waldo with zero-shot invariant and efficient visual search. Nature communications, 9(1):1–15, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sounak Mondal (6 papers)
  2. Zhibo Yang (43 papers)
  3. Seoyoung Ahn (10 papers)
  4. Dimitris Samaras (125 papers)
  5. Gregory Zelinsky (11 papers)
  6. Minh Hoai (48 papers)
Citations (22)

Summary

  • The paper presents a novel ZeroGaze task that enables zero-shot gaze prediction without relying on pre-trained object detectors.
  • The paper leverages a hybrid method combining ResNet-50 visual features with RoBERTa embeddings to integrate semantic and contextual information.
  • The paper demonstrates that Gazeformer outperforms existing models with up to 70% improvement and is over five times faster in real-time applications.

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

The paper "Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention" tackles the critical problem of predicting human gaze in interactive systems. Specifically, it introduces a novel task, termed ZeroGaze, which is a variant of zero-shot learning for gaze prediction involving objects that the model has never encountered during training. The proposed solution, Gazeformer, is a new model designed to solve this challenging task by leveraging transformer-based architectures and linguistic embeddings to efficiently predict gaze for unseen target categories.

Key Contributions and Model Architecture

The Gazeformer model stands out in its approach to predict human gaze by moving away from reliance on trained object detectors, which limit scalability and adaptability. Instead, Gazeformer utilizes a natural LLM to encode the search target, thereby harnessing semantic similarities between different objects for improved prediction accuracy. The architecture comprises a transformer-based encoder-decoder setup, wherein:

  • Image Feature Encoding: Features are extracted using a ResNet-50 backbone, outputting contextual image representations post processing through transformer encoder layers.
  • Semantic Feature Encoding: Targets are represented using embeddings from RoBERTa, allowing the model to generalize across unseen categories based on linguistic correlations.
  • Joint Feature Embedding: The visual and semantic features are integrated into a shared multimodal space.
  • Parallel Scanpath Prediction: The model employs a novel approach to predict simultaneous fixation points and durations through Gaussian distributions, which offers substantial speed advantages over traditional sequential prediction methods.

Empirical Evaluation and Performance

Gazeformer achieves significant performance improvements over existing models across several metrics such as Sequence Score (SS), Fixation Edit Distance (FED), and others in both ZeroGaze and traditional gaze prediction tasks. Notably, Gazeformer outpaces competitive models with a 19% to 70% margin in the ZeroGaze setup. This is attributed to its efficient handling of semantic and contextual information, allowing it to not only predict gaze for previously seen targets but also novel targets without the need for specific training data on them.

Gazeformer also excels in terms of inference speed, being over five times faster than prior state-of-the-art models, which is crucial for deployment in real-time interactive systems.

Implications and Future Directions

The results underscore the potential of language embeddings and transformer architectures in gaze prediction tasks, especially in scenarios where scalability and speed are paramount. By adopting a more general representation of targets, Gazeformer represents a step forward in the integration of gaze prediction in interactive systems, like augmented and virtual reality devices, where fast and reliable user attention modeling is required.

Furthermore, the ZeroGaze task broadens the practical applicability of gaze tracking systems, indicating paths for future research in adapting the model for broader AI tasks, including expansive object referral scenarios and other vision-based tasks. Investigating the extensibility of Gazeformer to embody complex target descriptives via complete language expressions presents a promising avenue.

In conclusion, Gazeformer marks a notable advancement in the field of goal-directed attention modeling, demonstrating superior scalability, effectiveness, and speed, hence presenting significant implications for its use in varied human-computer interaction contexts and rapid adaptation to new or seldom-seen objects in real-world applications.

Youtube Logo Streamline Icon: https://streamlinehq.com