Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model

Published 8 Mar 2024 in cs.CV | (2403.05124v1)

Abstract: Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-LLM to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Cdul: Clip-driven unsupervised learning for multi-label image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1348–1357.
  2. Conversational Gaze Aversion for Humanlike Robots. In 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 25–32.
  3. Generalizing Gaze Estimation With Rotation Consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4207–4216.
  4. Source-Free Adaptive Gaze Estimation by Uncertainty Reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22035–22045.
  5. Puregaze: Purifying gaze feature for generalizable gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 436–443.
  6. Appearance-Based Gaze Estimation via Evaluation-Guided Asymmetric Regression. In Proceedings of the European Conference on Computer Vision (ECCV).
  7. Gaze Estimation by Exploring Two-Eye Asymmetry. IEEE Transactions on Image Processing, 29: 5259–5272.
  8. Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the symposium on eye tracking research and applications, 255–258.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  10. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6912–6921.
  11. Eye Tracking for Everyone. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2176–2184.
  12. LatentGaze: Cross-Domain Gaze Estimation through Gaze-Aware Analytic Latent Code Manipulation. In Proceedings of the Asian Conference on Computer Vision, 3379–3395.
  13. Ordinalclip: Learning rank prompts for language-guided ordinal regression. Advances in Neural Information Processing Systems, 35: 35313–35325.
  14. Jitter Does Matter: Adapting Gaze Estimation to New Domains. arXiv preprint arXiv:2210.02082.
  15. Generalizing Gaze Estimation with Outlier-guided Collaborative Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  16. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
  17. Eye gaze tracking based driver monitoring system. In 2017 IEEE International Conference on Circuits and Systems (ICCS), 364–367.
  18. Meet Me where I’m Gazing: How Shared Attention Gaze Affects Human-Robot Handover Timing. In 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 334–341.
  19. Optimizing VR for All Users through Adaptive Focus Displays. In ACM SIGGRAPH 2017 Talks, SIGGRAPH ’17. New York, NY, USA: Association for Computing Machinery. ISBN 9781450350082.
  20. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  21. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
  22. Face image quality assessment: A literature survey. ACM Computing Surveys (CSUR), 54(10s): 1–49.
  23. CLIP-Cluster: CLIP-Guided Attribute Hallucination for Face Clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 20786–20795.
  24. Nonlinear 3D Face Morphable Model. In IEEE Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT.
  25. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  26. CLIP the Gap: A Single Domain Generalization Approach for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3219–3229.
  27. Contrastive Regression for Domain Adaptation on Gaze Estimation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19354–19363.
  28. Learning a Generalized Gaze Estimator from Gaze-Consistent Feature. Proceedings of the AAAI Conference on Artificial Intelligence, 37(3): 3027–3035.
  29. Spatio-temporal modeling and prediction of visual attention in graphical user interfaces. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 3299–3310.
  30. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 9125–9138. Curran Associates, Inc.
  31. NeRF-Gaze: A Head-Eye Redirection Parametric Model for Gaze Estimation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  32. ETH-XGaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In European Conference on Computer Vision, 365–381. Springer.
  33. Appearance-based gaze estimation in the wild. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4511–4520.
  34. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE transactions on pattern analysis and machine intelligence, 41(1): 162–175.
  35. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
  36. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
Citations (5)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.