TransGOP: Transformer-Based Gaze Object Prediction (2402.13578v1)
Abstract: Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at https://github.com/chenxi-Guo/TransGOP.git.
- EFE: End-to-end Frame-to-Gaze Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2687–2696.
- Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
- Cascade r-cnn: Delving into high quality object detection. In IEEE Conf. Comput. Vis. Pattern Recog., 6154–6162.
- End-to-end object detection with transformers. In Eur. Conf. Comput. Vis., 213–229. Springer.
- Gaze Estimation Using Transformer. arxiv:2105.14424.
- Appearance-based gaze estimation via evaluation-guided asymmetric regression. In Eur. Conf. Comput. Vis., 100–115.
- Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In Eur. Conf. Comput. Vis., 383–398.
- Detecting attended visual targets in video. In IEEE Conf. Comput. Vis. Pattern Recog., 5396–5406.
- R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inform. Process. Syst., 29.
- Up-detr: Unsupervised pre-training for object detection with transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 1601–1610.
- Using visual attention estimation on videos for automated prediction of Autism Spectrum Disorder and symptom severity in preschool children. medRxiv, 2023–06.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Centernet: Keypoint triplets for object detection. In IEEE Conf. Comput. Vis. Pattern Recog., 6569–6578.
- The pascal visual object classes (voc) challenge. Int. J. Comput. Vis., 88: 303–308.
- You only look at one sequence: Rethinking transformer in vision through object detection. Adv. Neural Inform. Process. Syst., 34: 26183–26197.
- Dual attention guided gaze target detection in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., 11390–11399.
- Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659.
- Girshick, R. 2015. Fast r-cnn. In Int. Conf. Comput. Vis., 1440–1448.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., 580–587.
- Enhanced gaze following via object detection and human pose estimation. In International Conference on Multimedia Modeling, 502–513. Springer.
- MGTR: End-to-End Mutual Gaze Detection with Transformer. ACCV.
- A generalized and robust method towards practical gaze estimation on smart phone. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 0–0.
- On-device few-shot personalization for real-time gaze estimation. In Int. Conf. Comput. Vis. Worksh., 0–0.
- Mask r-cnn. In Int. Conf. Comput. Vis., 2961–2969.
- Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(9): 1904–1916.
- Deep residual learning for image recognition. In IEEE Conf. Comput. Vis. Pattern Recog., 770–778.
- Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874.
- TabletGaze: dataset and analysis for unconstrained appearance-based gaze estimation in mobile tablets. Machine Vision and Applications, 28(5): 445–461.
- Learning to predict where humans look. In Int. Conf. Comput. Vis., 2106–2113. IEEE.
- Kleinke, C. L. 1986. Gaze and eye contact: a research review. Psychological bulletin, 100(1): 78.
- Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process., 29: 7389–7398.
- Eye tracking for everyone. In IEEE Conf. Comput. Vis. Pattern Recog., 2176–2184.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84–90.
- Looking and acting: vision and eye movements in natural behaviour. Oxford University Press.
- Cornernet: Detecting objects as paired keypoints. In Eur. Conf. Comput. Vis., 734–750.
- Learning gaze transitions from depth to improve video saliency estimation. In Int. Conf. Comput. Vis., 1698–1707.
- Dn-detr: Accelerate detr training by introducing query denoising. In IEEE Conf. Comput. Vis. Pattern Recog., 13619–13627.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, 280–296. Springer.
- Believe it or not, we know what you are looking at! In ACCV, 35–50. Springer.
- Focal loss for dense object detection. In IEEE Conf. Comput. Vis. Pattern Recog., 2980–2988.
- Microsoft coco: Common objects in context. In Eur. Conf. Comput. Vis., 740–755. Springer.
- Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329.
- Ssd: Single shot multibox detector. In Eur. Conf. Comput. Vis., 21–37. Springer.
- Swin transformer v2: Scaling up capacity and resolution. In IEEE Conf. Comput. Vis. Pattern Recog., 12009–12019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Int. Conf. Comput. Vis., 10012–10022.
- Conditional detr for fast training convergence. In Int. Conf. Comput. Vis., 3651–3660.
- Deep pictorial gaze estimation. In Eur. Conf. Comput. Vis., 721–738.
- Learning to find eye region landmarks for remote gaze estimation in unconstrained settings. In Proceedings of the 2018 ACM symposium on eye tracking research & applications, 1–10.
- Where are they looking? Adv. Neural Inform. Process. Syst., 28.
- Following gaze in video. In Int. Conf. Comput. Vis., 1435–1443.
- You only look once: Unified, real-time object detection. In IEEE Conf. Comput. Vis. Pattern Recog., 779–788.
- YOLO9000: better, faster, stronger. In IEEE Conf. Comput. Vis. Pattern Recog., 7263–7271.
- Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst., 28.
- Generalized intersection over union: A metric and a loss for bounding box regression. In IEEE Conf. Comput. Vis. Pattern Recog., 658–666.
- Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 115: 211–252.
- Human gaze following for human-robot interaction. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 8615–8621. IEEE.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Going deeper with convolutions. In IEEE Conf. Comput. Vis. Pattern Recog., 1–9.
- Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images. arXiv preprint arXiv:2205.13764.
- Goo: A dataset for gaze object prediction in retail environments. In IEEE Conf. Comput. Vis. Pattern Recog., 3125–3133.
- Object-aware Gaze Target Detection. arXiv preprint arXiv:2307.09662.
- End-to-end human-gaze-target detection with transformers. In IEEE Conf. Comput. Vis. Pattern Recog., 2192–2200. IEEE.
- End-to-End Human-Gaze-Target Detection with Transformers. arxiv:2203.10433.
- Attention is all you need. Adv. Neural Inform. Process. Syst., 30.
- GaTector: A Unified Framework for Gaze Object Prediction. In IEEE Conf. Comput. Vis. Pattern Recog., 19588–19597.
- YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7464–7475.
- Anchor detr: Query design for transformer-based detector. In AAAI, 2567–2575.
- Sun database: Large-scale scene recognition from abbey to zoo. In IEEE Conf. Comput. Vis. Pattern Recog., 3485–3492. IEEE.
- CircleNet: Anchor-free glomerulus detection with circle representation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 35–44. Springer.
- Human action recognition by learning bases of action attributes and parts. In Int. Conf. Comput. Vis., 1331–1338. IEEE.
- NeRF-Gaze: A Head-Eye Redirection Parametric Model for Gaze Estimation. arXiv preprint arXiv:2212.14710.
- Boundary-Aware Salient Object Detection in Optical Remote-Sensing Images. Electronics, 11(24): 4200.
- Glance-and-Gaze Vision Transformer. In Advances in Neural Information Processing Systems, volume 34, 12992–13003. Curran Associates, Inc.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In Int. Conf. Learn. Represent.
- Appearance-based gaze estimation in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., 4511–4520.
- It’s written all over your face: Full-face appearance-based gaze estimation. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 51–60.
- Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell., 41(1): 162–175.
- Learning to draw sight lines. Int. J. Comput. Vis., 128(5): 1076–1100.
- Less is More: Focus Attention for Efficient DETR. arXiv preprint arXiv:2307.12612.
- Learning deep features for scene recognition using places database. Adv. Neural Inform. Process. Syst., 27.
- Bottom-up object detection by grouping extreme and center points. In IEEE Conf. Comput. Vis. Pattern Recog., 850–859.
- Monocular free-head 3d gaze tracking with deep learning and geometry constraints. In Int. Conf. Comput. Vis., 3143–3152.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.
- Eye gaze tracking under natural head movements. In IEEE Conf. Comput. Vis. Pattern Recog., volume 1, 918–923. IEEE.