Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation (2404.05215v2)

Published 8 Apr 2024 in cs.CV

Abstract: Gaze is an essential prompt for analyzing human behavior and attention. Recently, there has been an increasing interest in determining gaze direction from facial videos. However, video gaze estimation faces significant challenges, such as understanding the dynamic evolution of gaze in video sequences, dealing with static backgrounds, and adapting to variations in illumination. To address these challenges, we propose a simple and novel deep learning model designed to estimate gaze from videos, incorporating a specialized attention module. Our method employs a spatial attention mechanism that tracks spatial dynamics within videos. This technique enables accurate gaze direction prediction through a temporal sequence model, adeptly transforming spatial observations into temporal insights, thereby significantly improving gaze estimation accuracy. Additionally, our approach integrates Gaussian processes to include individual-specific traits, facilitating the personalization of our model with just a few labeled samples. Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings. Specifically, our proposed approach achieves state-of-the-art performance on the Gaze360 dataset, improving by $2.5\circ$ without personalization. Further, by personalizing the model with just three samples, we achieved an additional improvement of $0.8\circ$. The code and pre-trained models are available at \url{https://github.com/jswati31/stage}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. L2cs-net: Fine-grained gaze estimation in unconstrained environments. arXiv preprint arXiv:2203.03339, 2022.
  2. Guiding visual attention on 2d screens: Effects of gaze cues from avatars and humans. In Proceedings of the 2023 ACM Symposium on Spatial User Interaction, pages 1–9, 2023.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  4. Mau: A motion-aware unit for video prediction and beyond. Advances in Neural Information Processing Systems, 34:26950–26962, 2021.
  5. Offset calibration for appearance-based gaze estimation via gaze decomposition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 270–279, 2020.
  6. Offset calibration for appearance-based gaze estimation via gaze decomposition. 2020 IEEE Winter Conference on Applications of Computer Vision, pages 259–268, 2019.
  7. Gaze estimation using transformer. In 2022 26th International Conference on Pattern Recognition, pages 3341–3347. IEEE, 2022.
  8. Gaze estimation by exploring two-eye asymmetry. IEEE Transactions on Image Processing, 29:5259–5272, 2020.
  9. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In Proceedings of the European conference on computer vision, pages 383–398, 2018.
  10. Randomly projected additive gaussian processes for regression. In International Conference on Machine Learning, 2019.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  12. Towards self-supervised gaze estimation. In British Machine Vision Conference, 2022.
  13. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
  14. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  15. Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision, pages 334–352, 2018.
  16. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the symposium on eye tracking research and applications, pages 255–258, 2014.
  17. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, 2018.
  18. Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 971–980, 2017.
  19. Gaze estimation via a differential eyes’ appearances network with a reference grid. Engineering, 7(6):777–786, 2021.
  20. General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Transactions on biomedical engineering, 53(6):1124–1133, 2006.
  21. In the eye of the beholder: A survey of models for eyes and gaze. IEEE transactions on pattern analysis and machine intelligence, 32(3):478–500, 2009.
  22. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  23. Tabletgaze: dataset and analysis for unconstrained appearance-based gaze estimation in mobile tablets. Machine Vision and Applications, 28:445–461, 2017.
  24. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
  25. Contrastive representation learning for gaze estimation. In Annual Conference on Neural Information Processing Systems, pages 37–49. PMLR, 2023.
  26. A review and analysis of eye-gaze estimation systems, algorithms and performance evaluation methods in consumer platforms. IEEE Access, 5:16495–16519, 2017.
  27. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  28. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6912–6921, 2019.
  29. Eye tracking for everyone. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2176–2184, 2016.
  30. Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recognition, 98:107037, 2020.
  31. Multiview multitask gaze estimation with deep convolutional neural networks. IEEE transactions on neural networks and learning systems, 30(10):3010–3023, 2018.
  32. Learning to personalize in appearance-based gaze tracking. In Proceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019.
  33. A differential approach for gaze estimation with calibration. In BMVC, page 6, 2018.
  34. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  35. Estimating 3d gaze directions using unlabeled eye images via synthetic iris appearance fitting. IEEE Transactions on Multimedia, 18(9):1772–1782, 2016.
  36. Integrating human gaze into attention for egocentric activity recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1069–1078, 2021.
  37. 360-degree gaze estimation in the wild using multiple zoom scales. In British Machine Vision Conference, 2020.
  38. Meet me where i’m gazing: how shared attention gaze affects human-robot handover timing. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pages 334–341, 2014.
  39. Point of gaze estimation through corneal surface reflection in an active illumination environment. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part II 12, pages 159–172. Springer, 2012.
  40. Radford M. Neal. Assessing relevance determination methods using delve. 1998.
  41. Optimizing virtual reality for all users through gaze-contingent and adaptive focus displays. Proceedings of the National Academy of Sciences, 114(9):2183–2188, 2017.
  42. Robot reading human gaze: Why eye tracking is better than head tracking for human-robot collaboration. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5048–5054. IEEE, 2016.
  43. Recurrent cnn for 3d gaze estimation using appearance and shape cues. In British Machine Vision Conference, 2018.
  44. Robust change captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4624–4633, 2019a.
  45. Deep pictorial gaze estimation. volume 11217 lncs. 2018a.
  46. Learning to find eye region landmarks for remote gaze estimation in unconstrained settings. In Proceedings of the 2018 ACM symposium on eye tracking research & applications, pages 1–10, 2018b.
  47. Few-shot adaptive gaze estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9368–9377, 2019b.
  48. Towards end-to-end video-based eye-tracking. In Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part 16, pages 747–763. Springer, 2020.
  49. Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision research, 116:113–126, 2015.
  50. Automatic differentiation in pytorch. 2017.
  51. Towards foveated rendering for gaze-tracked virtual reality. ACM Transactions on Graphics (TOG), 35(6):1–12, 2016.
  52. Describing and localizing multiple changes with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1971–1980, 2021.
  53. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  54. Carl Edward Rasmussen. Gaussian Processes in Machine Learning, pages 63–71. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
  55. Gaze estimation via bilinear pooling-based attention networks. Journal of Visual Communication and Image Representation, 81:103369, 2021.
  56. Learning video saliency from human gaze using candidate selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1147–1154, 2013.
  57. Deep residual learning for image recognition: a survey. Applied Sciences, 12(18):8972, 2022.
  58. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  59. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
  60. Appearance-based eye gaze estimation. In Sixth IEEE Workshop on Applications of Computer Vision, 2002. Proceedings., pages 191–195. IEEE, 2002.
  61. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  62. Semantic relation-aware difference representation learning for change captioning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 63–73, 2021.
  63. Combining head pose and eye location information for gaze estimation. IEEE Transactions on Image Processing, 21(2):802–815, 2011.
  64. Object referring in videos with language and human gaze. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4129–4138, 2018.
  65. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  66. Neuro-inspired eye tracking with eye movement dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9831–9840, 2019.
  67. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4305–4314, 2015.
  68. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
  69. Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Transactions on Multimedia, 20(3):634–644, 2017.
  70. Pre-trained gaussian processes for bayesian optimization. arXiv preprint arXiv:2109.08215, 2021.
  71. Group normalization. In Proceedings of the European conference on computer vision, pages 3–19, 2018.
  72. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4511–4520, 2015.
  73. It’s written all over your face: Full-face appearance-based gaze estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 51–60, 2017a.
  74. Evaluation of appearance-based methods and implications for gaze-based applications. In Proceedings of the 2019 CHI conference on human factors in computing systems, pages 1–13, 2019.
  75. Look together: using gaze for assisting co-located collaborative search. Personal and Ubiquitous Computing, 21:173–186, 2017b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Swati Jindal (3 papers)
  2. Mohit Yadav (8 papers)
  3. Roberto Manduchi (7 papers)

Summary

We haven't generated a summary for this paper yet.