Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration (2403.07246v1)
Abstract: Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. In this paper, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-LLM to improve zero-shot HOI detection. Specifically, the verb feature learning module is designed based on visual semantics, by employing the verb extraction decoder to convert corresponding verb queries into interaction-specific category representations. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Moreover, the innovative interaction representation decoder effectively extracts informative regions by integrating spatial and visual feature information through a cross-attention mechanism. To deal with zero-shot learning in low-data, we leverage a priori knowledge from the CLIP text encoder to initialize the linear classifier for enhanced interaction understanding. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings.
- M. Antoun and D. Asmar, “Human object interaction detection: Design and survey,” Image and Vision Computing, p. 104617, 2022.
- J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.
- L. Chen, X. Yan, J. Xiao, H. Zhang, S. Pu, and Y. Zhuang, “Counterfactual samples synthesizing for robust visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 800–10 809.
- J. Liu and Q. Liu, “R3CD: Scene Graph to Image Generation with relation-aware Compositional Contrastive Control Diffusion,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
- Z. Fu, C. Zheng, J. Feng, Y. Cai, X.-Y. Wei, Y. Wang, and Q. Li, “Drake: Deep pair-wise relation alignment for knowledge-enhanced multimodal scene graph generation in social media posts,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 7, pp. 3199–3213, 2023.
- Y. Du, F. Wei, Z. Zhang, M. Shi, Y. Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 084–14 093.
- D. Kim, A. Angelova, and W. Kuo, “Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers,” CoRR, vol. abs/2305.07011, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.07011
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- M. Wu, J. Gu, Y. Shen, M. Lin, C. Chen, and X. Sun, “End-to-end zero-shot hoi detection via vision and language knowledge distillation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 2839–2846.
- X. Qu, C. Ding, X. Li, X. Zhong, and D. Tao, “Distillation using oracle queries for transformer-based human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 558–19 567.
- S. Liu, L. Zhang, X. Yang, H. Su, and J. Zhu, “Query2label: A simple transformer way to multi-label classification,” arXiv preprint arXiv:2107.10834, 2021.
- B. Wan, Y. Liu, D. Zhou, T. Tuytelaars, and X. He, “Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning,” arXiv preprint arXiv:2303.01313, 2023.
- Z. Li, C. Zou, Y. Zhao, B. Li, and S. Zhong, “Improving human-object interaction detection via phrase learning and label composition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1509–1517.
- Y. Liu, Q. Chen, and A. Zisserman, “Amplifying key cues for human-object-interaction detection,” in European Conference on Computer Vision. Springer, 2020, pp. 248–265.
- Y. Cao, Q. Tang, F. Yang, X. Su, S. You, X. Lu, and C. Xu, “Re-mine, learn and reason: Exploring the cross-modal semantic correlations for language-guided hoi detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 492–23 503.
- Y. Cheng, Z. Wang, W. Zhan, and H. Duan, “Multi-scale human-object interaction detector,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 4, pp. 1827–1838, 2023.
- Y. Wang, Q. Liu, and Y. Lei, “Ted-net: Dispersal attention for perceiving interaction region in indirectly-contact hoi detection,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2024.
- Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Detecting human-object interaction via fabricated compositional learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 646–14 655.
- F. Z. Zhang, D. Campbell, and S. Gould, “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 104–20 112.
- Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu, “GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20 091–20 100.
- B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim, “Hotr: End-to-end human-object interaction detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 74–83.
- X. Liu, Y.-L. Li, X. Wu, Y.-W. Tai, C. Lu, and C.-K. Tang, “Interactiveness field in human-object interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 113–20 122.
- H. Yuan, J. Jiang, S. Albanie, T. Feng, Z. Huang, D. Ni, and M. Tang, “Rlip: Relational language-image pre-training for human-object interaction detection,” Advances in Neural Information Processing Systems, vol. 35, pp. 37 416–37 431, 2022.
- S. Kim, D. Jung, and M. Cho, “Relational Context Learning for Human-Object Interaction Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2925–2934.
- D. Zhou, Z. Liu, J. Wang, L. Wang, T. Hu, E. Ding, and J. Wang, “Human-object interaction detection via disentangled transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 568–19 577.
- D. Yang, Y. Zou, C. Zhang, M. Cao, and J. Chen, “Rr-net: Relation reasoning for end-to-end human-object interaction detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3853–3865, 2022.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
- M. Tamura, H. Ohashi, and T. Yoshinaga, “Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 410–10 419.
- S. Ning, L. Qiu, Y. Liu, and X. He, “HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 23 507–23 517.
- L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
- M. Dzabraev, M. Kalashnikov, S. Komkov, and A. Petiushko, “Mdmmt: Multidomain multimodal transformer for video retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3354–3363.
- J. Li, W. K. Wong, L. Jiang, X. Fang, S. Xie, and Y. Xu, “Ckdh: Clip-based knowledge distillation hashing for cross-modal retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2024.
- P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
- W. Jin, Y. Cheng, Y. Shen, W. Chen, and X. Ren, “A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models,” arXiv preprint arXiv:2110.08484, 2021.
- X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.
- J. Peyre, J. Sivic, I. Laptev, and C. Schmid, “Detecting unseen visual relations using analogies,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1981–1990.
- Y. Liu, J. Yuan, and C. W. Chen, “Consnet: Learning consistency graph for zero-shot human-object interaction detection,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4235–4243.
- Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Affordance transfer learning for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 495–504.
- Y. Li, G. Yuan, Y. Wen, J. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, and J. Ren, “Efficientformer: Vision transformers at mobilenet speed,” Advances in Neural Information Processing Systems, vol. 35, pp. 12 934–12 949, 2022.
- Z. Hou, X. Peng, Y. Qiao, and D. Tao, “Visual compositional learning for human-object interaction detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer, 2020, pp. 584–600.
- Z. Hou, B. Yu, and D. Tao, “Discovering human-object interaction concepts via self-compositional learning,” in European Conference on Computer Vision. Springer, 2022, pp. 461–478.
- A. Bansal, S. S. Rambhatla, A. Shrivastava, and R. Chellappa, “Detecting human-object interactions via functional generalization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 460–10 469.
- Y.-L. Li, X. Liu, X. Wu, Y. Li, and C. Lu, “Hoi analysis: Integrating and decomposing human-object interaction,” Advances in Neural Information Processing Systems, vol. 33, pp. 5011–5022, 2020.
- F. Z. Zhang, D. Campbell, and S. Gould, “Spatially conditioned graphs for detecting human-object interactions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 319–13 327.
- A. Zhang, Y. Liao, S. Liu, M. Lu, Y. Wang, C. Gao, and X. Li, “Mining the benefits of two-stage and one-stage hoi detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 209–17 220, 2021.
- D. Tu, X. Min, H. Duan, G. Guo, G. Zhai, and W. Shen, “Iwin: Human-object interaction detection via transformer with irregular windows,” in European Conference on Computer Vision. Springer, 2022, pp. 87–103.
- T. Lei, F. Caba, Q. Chen, H. Jin, Y. Peng, and Y. Liu, “Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6480–6490.
- S. Zheng, B. Xu, and Q. Jin, “Open-category Human-Object Interaction Pre-Training via Language Modeling Framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 392–19 402.
- Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng, “Learning to detect human-object interactions,” in 2018 ieee winter conference on applications of computer vision (wacv). IEEE, 2018, pp. 381–389.
- S. Gupta and J. Malik, “Visual semantic role labeling,” arXiv preprint arXiv:1505.04474, 2015.
- Weiying Xue (5 papers)
- Qi Liu (485 papers)
- Qiwei Xiong (3 papers)
- Yuxiao Wang (21 papers)
- Zhenao Wei (5 papers)
- Xiaofen Xing (29 papers)
- Xiangmin Xu (54 papers)