Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos (2404.07351v1)

Published 10 Apr 2024 in cs.CV, cs.HC, and cs.LG

Abstract: Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer, with the primary role of watching videos and simulating human gaze behavior. We employed an eye-tracking dataset gathered from videos generated by the VirtualHome simulator, with a primary focus on activity recognition. Our experimental results demonstrate the effectiveness of our gaze prediction method by highlighting its capability to replicate human gaze behavior and its applicability for downstream tasks where real human-gaze is used as input.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Medirl: Predicting the visual attention of drivers via maximum entropy deep inverse reinforcement learning. In Proceedings of the IEEE/CVF international conference on computer vision. 13178–13188.
  2. Maria Barrett and Nora Hollenstein. 2020. Sequence labelling and sequence classification with gaze: Novel uses of eye-tracking data for Natural Language Processing. Language and Linguistics Compass 14, 11 (2020), 1–16.
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
  4. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems 34 (2021), 15084–15097.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  6. Learning to recognize daily actions using gaze. In European Conference on Computer Vision. Springer, 314–327.
  7. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  8. FixationNet: Forecasting Eye Fixations in Task-Oriented Virtual Environments. IEEE Transactions on Visualization and Computer Graphics 27, 5 (2021), 2681–2690. https://doi.org/10.1109/TVCG.2021.3067779
  9. DGaze: CNN-Based Gaze Prediction in Dynamic Scenes. IEEE Transactions on Visualization and Computer Graphics 26, 5 (2020), 1902–1911. https://doi.org/10.1109/TVCG.2020.2973473
  10. SGaze: A Data-Driven Eye-Head Coordination Model for Realtime Gaze Prediction. IEEE Transactions on Visualization and Computer Graphics 25, 5 (2019), 2002–2010. https://doi.org/10.1109/TVCG.2019.2899187
  11. Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European conference on computer vision (ECCV). 754–769.
  12. Videograph: Recognizing minutes-long human activities in videos. arXiv preprint arXiv:1905.05143 (2019).
  13. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285.
  14. Creation and validation of a chest X-ray dataset with eye-tracking and report dictation for AI development. Scientific data 8, 1 (2021), 92.
  15. Andrej Karpathy. [n. d.]. minGPT. https://github.com/karpathy/minGPT. Accessed: 2023-04-23.
  16. Gaze prediction using machine learning for dynamic stereo manipulation in games. In 2016 IEEE Virtual Reality (VR). IEEE, 113–120. https://doi.org/10.1109/VR.2016.7504694
  17. Learning to predict gaze in egocentric video. In Proceedings of the IEEE international conference on computer vision. 3216–3223.
  18. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European conference on computer vision (ECCV). 619–635.
  19. Improving driver gaze prediction with reinforced attention. IEEE Transactions on Multimedia 23 (2020), 4198–4207.
  20. A* sampling. Advances in neural information processing systems 27 (2014).
  21. Kyle Min and Jason J Corso. 2021. Integrating human gaze into attention for egocentric activity recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1069–1078.
  22. Ego-topo: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 163–172.
  23. Coupling eye-motion and ego-motion features for first-person activity recognition. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 1–7.
  24. Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention. In 2024 Symposium on Eye Tracking Research and Applications. ACM, Glasgow, United Kingdom, to appear. https://doi.org/10.1145/3649902.3653340 Suleyman Ozdel and Yao Rong contributed equally to this research..
  25. Predicting the Driver’s Focus of Attention: the DR (eye) VE Project. IEEE transactions on pattern analysis and machine intelligence 41, 7 (2018), 1720–1733.
  26. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8494–8502.
  27. Improving language understanding by generative pre-training. (2018).
  28. Predicting the future from first person (egocentric) vision: A survey. Computer Vision and Image Understanding 211 (2021), 103252.
  29. GazeTransformer: Gaze Forecasting for Virtual Reality Using Transformer Networks. In DAGM German Conference on Pattern Recognition. Springer, 577–593.
  30. Juergen Schmidhuber. 2019. Reinforcement Learning Upside Down: Don’t Predict Rewards–Just Map Them to Actions. arXiv preprint arXiv:1912.02875 (2019).
  31. Combining gaze and AI planning for online human intention recognition. Artificial Intelligence 284 (2020), 103275.
  32. Attention is all you need. Advances in neural information processing systems 30 (2017).
  33. Gaze-based intention anticipation over driving manoeuvres in semi-autonomous vehicles. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 6210–6216.
  34. Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 193–202.
  35. Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4372–4381.
  36. Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4372–4381. https://doi.org/10.1109/CVPR.2017.377
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yao Rong (30 papers)
  2. Berat Mert Albaba (4 papers)
  3. Yen-Ling Kuo (22 papers)
  4. Xi Wang (275 papers)
  5. Enkelejda Kasneci (97 papers)
  6. Suleyman Ozdel (6 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.