Learning from Observer Gaze:Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition (2405.09931v1)
Abstract: Most existing attention prediction research focuses on salient instances like humans and objects. However, the more complex interaction-oriented attention, arising from the comprehension of interactions between instances by human observers, remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap, we first collect a novel gaze fixation dataset named IG, comprising 530,000 fixation points across 740 diverse interaction categories, capturing visual attention during human observers cognitive processes of interactions. Subsequently, we introduce the zero-shot interaction-oriented attention prediction task ZeroIA, which challenges models to predict visual cues for interactions not encountered during training. Thirdly, we present the Interactive Attention model IA, designed to emulate human observers cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly, we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA.
- Predicting visual attention in graphic design documents. IEEE Transactions on Multimedia, 25:4478–4493, 2023.
- Learning to detect human-object interactions. In WACV, pages 381–389. IEEE, 2018.
- Air: Attention with reasoning capability. In ECCV, pages 91–107. Springer, 2020.
- Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3):201–215, 2002.
- Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding, 163:90–100, 2017.
- Visual saliency prediction using a mixture of deep neural networks. IEEE TIP, 27(8):4080–4090, 2018.
- Modelling search for people in 900 scenes: A combined source model of eye guidance. Visual cognition, 17(6-7):945–978, 2009.
- Harmonizing the object recognition strategies of deep neural networks with humans. NIPS, 35:9432–9446, 2022.
- Pet: An eye-tracking dataset for animal-centric pascal object classes. In ICME, pages 1–6. IEEE, 2015.
- A joint cascaded framework for simultaneous eye detection and eye state estimation. Pattern Recognition, 67:23–31, 2017.
- Mal-net: Multiscale attention link network for accurate eye center detection. Computer Vision and Image Understanding, 234:103750, 2023.
- Driver attention prediction based on convolution and transformers. The Journal of Supercomputing, 78(6):8268–8284, 2022.
- Cascade learning for driver facial monitoring. IEEE TIV, 8(1):404–412, 2022.
- Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018.
- Captioning images taken by people who are blind. In ECCV, pages 417–434. Springer, 2020.
- Graph-based visual saliency. NIPS, 19, 2006.
- Probing image-language transformers for verb understanding. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, ACL, volume ACL/IJCNLP 2021 of Findings of ACL, pages 3635–3644. Association for Computational Linguistics, 2021.
- Visual compositional learning for human-object interaction detection. In ECCV, pages 584–600. Springer, 2020.
- Affordance transfer learning for human-object interaction detection. In CVPR, pages 495–504, 2021.
- Driver scanpath prediction based on inverse reinforcement learning. In ICASSP, pages 8306–8310, 2024.
- A model of saliency-based visual attention for rapid scene analysis. PAMI, 20(11):1254–1259, 1998.
- Does text attract attention on e-commerce images: A novel saliency prediction dataset and method. In CVPR, pages 2088–2097, June 2022.
- Salicon: Saliency in context. In CVPR, pages 1072–1080, 2015.
- The role of eye gaze in security and privacy applications: Survey and future hci research directions. In CHI, pages 1–21, 2020.
- Uniondet: Union-level detector towards real-time human-object interaction detection. In ECCV, pages 498–514. Springer, 2020.
- Hotr: End-to-end human-object interaction detection with transformers. In CVPR, pages 74–83, 2021.
- Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In CVPR, pages 19578–19587, 2022.
- Relational context learning for human-object interaction detection. In CVPR, pages 2925–2934, 2023.
- Deep gaze I: boosting saliency prediction with feature maps trained on imagenet. In Yoshua Bengio and Yann LeCun, editors, ICLRW, 2015.
- Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023.
- Exploring visual interpretability for contrastive language-image pre-training. arXiv preprint arXiv:2209.07046, 2022.
- Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In CVPR, pages 482–490, 2020.
- Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In CVPR, pages 20123–20132, 2022.
- Deepgaze iie: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling. In ICCV, pages 12919–12928, October 2021.
- What are the visual features underlying human versus machine vision? In ICCVW, pages 2706–2714, 2017.
- Learning what and where to attend. In ICLR. OpenReview.net, 2019.
- Learning from interaction-enhanced scene graph for pedestrian collision risk assessment. IEEE TIV, 2023.
- Goal-oriented gaze estimation for zero-shot learning. In CVPR, pages 3794–3803, 2021.
- A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
- Verbs in action: Improving verb understanding in video-language models. In ICCV, pages 15579–15591, October 2023.
- Gazeformer: Scalable, effective and fast prediction of goal-directed human attention. In CVPR, pages 1441–1450, 2023.
- Predicting the driver’s focus of attention: The dr(eye)ve project. IEEE TPAMI, 41:1720–1733, 2017.
- Training object class detectors from eye tracking data. In ECCV, pages 361–376. Springer, 2014.
- Exposing the limits of video-text models through contrast sets. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3574–3586, 2022.
- Human generalization of internal representations through prototype learning with goal-directed attention. Nature Human Behaviour, 7(3):442–463, 2023.
- The attention system of the human brain. Annual review of neuroscience, 13(1):25–42, 1990.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Human attention maps for text classification: Do humans and neural networks focus on the same words? In ACL, pages 4596–4608, 2020.
- Top-down visual attention from analysis by synthesis. In CVPR, pages 2102–2112, 2023.
- Childplay: A new benchmark for understanding children’s gaze behaviour. In ICCV, pages 20935–20946, 2023.
- Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In CVPR, pages 10410–10419, 2021.
- Agglomerative transformer for human-object interaction detection. In ICCV, pages 21614–21624, 2023.
- Learning human-object interaction detection using interaction points. In CVPR, pages 4116–4125, 2020.
- Predicting goal-directed human attention using inverse reinforcement learning. In CVPR, pages 193–202, 2020.
- Target-absent human attention. In ECCV, pages 52–68. Springer, 2022.
- Teacher-generated spatial-attention labels boost robustness and accuracy of contrastive models. In CVPR, pages 23282–23291, 2023.
- Benchmarking gaze prediction for categorical visual search. In CVPRW, pages 0–0, 2019.
- Spatially conditioned graphs for detecting human-object interactions. In ICCV, pages 13319–13327, 2021.
- Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In CVPR, pages 20104–20112, 2022.
- Human gaze assisted artificial intelligence: A review. In IJCAI, volume 2020, page 4951. NIH Public Access, 2020.
- Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In CVPR, pages 19548–19557, 2022.
- Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In CVPR, pages 13234–13243, 2021.
- Human-object interaction detection via disentangled transformer. In CVPR, pages 19568–19577, 2022.
- Hktsg: A hierarchical knowledge-guided traffic scene graph representation learning framework for intelligent vehicles. IEEE TIV, 2024.
- Hierarchical home action understanding with implicit and explicit prior knowledge. In ICASSP, pages 4015–4019, 2024.
- Learning from easy to hard pairs: Multi-step reasoning network for human-object interaction detection. In ACMMM, pages 4368–4377, 2023.
- Pit: Progressive interaction transformer for pedestrian crossing intention prediction. IEEE TITS, 2023.
- Toward driving scene understanding: A paradigm and benchmark dataset for ego-centric traffic scene graph representation. IEEE Journal of Radio Frequency Identification, 6:962–967, 2022.
- Unsupervised self-driving attention prediction via uncertainty mining and knowledge embedding. In ICCV, pages 8558–8568, October 2023.
- End-to-end human object interaction detection with hoi transformer. In CVPR, pages 11825–11834, 2021.