Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention (2404.07347v1)

Published 10 Apr 2024 in cs.CV, cs.HC, and cs.LG

Abstract: Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial video. We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input. Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention. To assess the efficiency of our approach, we collect a dataset containing household activities generated in the VirtualHome environment, accompanied by human gaze data of viewing videos. Our method outperforms state-of-the-art techniques, achieving a 7\% improvement in accuracy for 18-class intention recognition. This highlights the efficiency of our method in learning important features from human gaze data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Eye-hand behavior in human-robot shared manipulation. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 4–13.
  2. Ready for take-over? A new driver assistance system for an automated classification of driver take-over readiness. IEEE Intelligent Transportation Systems Magazine 9, 4 (2017), 10–22.
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
  4. Sijia Chen and Baochun Li. 2022. Multi-Modal Dynamic Graph Transformer for Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15534–15543.
  5. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV). 720–736.
  6. Videograph: Recognizing minutes-long human activities in videos. arXiv preprint arXiv:1905.05143 (2019).
  7. Visual-semantic graph matching for visual grounding. In Proceedings of the 28th ACM International Conference on Multimedia. 4041–4050.
  8. Creation and validation of a chest X-ray dataset with eye-tracking and report dictation for AI development. Scientific data 8, 1 (2021), 92.
  9. KURT: A Household Assistance Robot Capable of Proactive Dialogue. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 855–859.
  10. Representing videos as discriminative sub-graphs for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3310–3319.
  11. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European conference on computer vision (ECCV). 619–635.
  12. Delving into egocentric actions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 287–295.
  13. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  14. Kyle Min and Jason J Corso. 2021. Integrating human gaze into attention for egocentric activity recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1069–1078.
  15. Ego-topo: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 163–172.
  16. Anneli Olsen. 2012. The Tobii I-VT fixation filter. Tobii Technology 21 (2012), 4–19.
  17. Software architecture for automating cognitive science eye-tracking data analysis and object annotation. IEEE Transactions on Human-Machine Systems 49, 3 (2019), 268–277.
  18. Evaluating the usability and users’ acceptance of a kitchen assistant robot in household environment. In 2017 26th IEEE international symposium on robot and human interactive communication (RO-MAN). IEEE, 987–992.
  19. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8494–8502.
  20. Watch-and-help: A challenge for social perception and human-ai collaboration. arXiv preprint arXiv:2010.09890 (2020).
  21. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  22. Human Attention in Fine-grained Classification. BMVC (2021).
  23. Gaze-based interaction for semi-automatic photo cropping. In Proceedings of the SIGCHI conference on Human Factors in computing systems. 771–780.
  24. Gaze-based, context-aware robotic system for assisted reaching and grasping. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 863–869.
  25. Martin Simonovsky and Nikos Komodakis. 2017. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3693–3702.
  26. Human–robot interaction in agriculture: A survey and current challenges. Biosystems engineering 179 (2019), 35–48.
  27. Survey on human–robot collaboration in industrial settings: Safety, intuitive interfaces and applications. Mechatronics 55 (2018), 248–266.
  28. Distilling location proposals of unknown objects through gaze information for human-robot interaction. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 11086–11093.
  29. Periphery-fovea multi-resolution driving model guided by human attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1767–1775.
  30. Predicting driver attention in critical situations. In ACCV. Springer, 658–674.
  31. Tnt: Target-driven trajectory prediction. In Conference on Robot Learning. PMLR, 895–904.
  32. Gimo: Gaze-informed human motion prediction in context. In European Conference on Computer Vision. Springer, 676–694.
  33. Graph-based high-order relation modeling for long-term action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8984–8993.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yao Rong (30 papers)
  2. Berat Mert Albaba (4 papers)
  3. Yen-Ling Kuo (22 papers)
  4. Xi Wang (275 papers)
  5. Enkelejda Kasneci (97 papers)
  6. Suleyman Ozdel (6 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com