Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention (2303.15274v3)
Abstract: Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural LLM, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.
- A comparison of scanpath comparison methods. Behavior research methods, 47(4):1377–1392, 2015.
- Assessing visual search performance using a novel dynamic naturalistic scene. Journal of Vision, 21(1):5–5, 2021.
- Eye gaze sequence analysis to model memory in e-education. In International Conference on Artificial Intelligence in Education, 2019.
- State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1):185–207, 2013.
- What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018.
- End-to-end object detection with transformers. In European conference on computer vision. Springer, 2020.
- Transformer-based long-term viewport prediction in 360° video: Scanpath is all you need. In 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2021.
- Predicting human scanpaths in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- Coco-search18 fixation dataset for predicting goal-directed attention control. Scientific reports, 11(1):8776, 2021.
- Characterizing target-absent human attention. In Proceedings of CVPR International Workshop on Gaze Estimation and Prediction in the Wild, 2022.
- Student performance prediction with eye-gaze data in embodied educational context. Education and Information Technologies, 28(1):833–855, 2023.
- The static and dynamic analyses of drivers’ gaze movement using vr driving simulator. Applied Sciences, 12(5):2362, 2022.
- Scanmatch: A novel method for comparing fixation sequences. Behavior research methods, 42(3):692–700, 2010.
- It depends on how you look at it: Scanpath comparison in multiple dimensions with multimatch, a vector-based approach. Behavior research methods, 44(4):1079–1100, 2012.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Write a classifier: Zero-shot learning using purely textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, 2013.
- Gaze gesture based human robot interaction for laparoscopic surgery. Medical image analysis, 44:196–214, 2018.
- A minimal model for predicting visual search in human-computer interaction. In Proceedings of the SIGCHI conference on Human factors in computing systems, 2007.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
- Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
- A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998.
- Assessing the contribution of color in visual attention. Computer Vision and Image Understanding, 100(1-2):107–123, 2005.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
- Arett: augmented reality eye tracking toolkit for head mounted displays. Sensors, 21(6):2234, 2021.
- Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation. IEEE Robotics and Automation Letters, 6(2):1630–1637, 2021.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9):4446–4456, 2017.
- Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045, 2014.
- Understanding low-and high-level contributions to fixation prediction. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
- Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707–710, 1965.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
- Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.
- A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970.
- Enhancing patient freedom in rehabilitation robotics using gaze-based intention detection. In 2013 IEEE 13th International Conference on Rehabilitation Robotics (ICORR). IEEE, 2013.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.
- Components of bottom-up gaze allocation in natural scenes. Journal of Vision, 5(8):692–692, 2005.
- Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. In Asian Conference on Computer Vision. Springer, 2018.
- Machine learning predicts responses to conceptual questions using eye movements. In Proceedings of the Physics Education Research Conference, 2018.
- Gaze-driven placement of items for proactive visual exploration. Journal of Visualization, 25(3):613–633, 2022.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 2021.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Wearable eye tracking for mental health monitoring. Computer Communications, 35(11):1306–1311, 2012.
- Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
- Target-absent human attention. In European Conference on Computer Vision, 2022.
- Benchmarking gaze prediction for categorical visual search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.
- Predicting goal-directed attention control using inverse-reinforcement learning. Neurons, behavior, data analysis and theory, 2020.
- Finding any waldo with zero-shot invariant and efficient visual search. Nature communications, 9(1):1–15, 2018.
- Sounak Mondal (6 papers)
- Zhibo Yang (43 papers)
- Seoyoung Ahn (10 papers)
- Dimitris Samaras (125 papers)
- Gregory Zelinsky (11 papers)
- Minh Hoai (48 papers)