Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration (2403.07246v1)

Published 12 Mar 2024 in cs.CV

Abstract: Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. In this paper, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-LLM to improve zero-shot HOI detection. Specifically, the verb feature learning module is designed based on visual semantics, by employing the verb extraction decoder to convert corresponding verb queries into interaction-specific category representations. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Moreover, the innovative interaction representation decoder effectively extracts informative regions by integrating spatial and visual feature information through a cross-attention mechanism. To deal with zero-shot learning in low-data, we leverage a priori knowledge from the CLIP text encoder to initialize the linear classifier for enhanced interaction understanding. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. M. Antoun and D. Asmar, “Human object interaction detection: Design and survey,” Image and Vision Computing, p. 104617, 2022.
  2. J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.
  3. L. Chen, X. Yan, J. Xiao, H. Zhang, S. Pu, and Y. Zhuang, “Counterfactual samples synthesizing for robust visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 800–10 809.
  4. J. Liu and Q. Liu, “R3CD: Scene Graph to Image Generation with relation-aware Compositional Contrastive Control Diffusion,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
  5. Z. Fu, C. Zheng, J. Feng, Y. Cai, X.-Y. Wei, Y. Wang, and Q. Li, “Drake: Deep pair-wise relation alignment for knowledge-enhanced multimodal scene graph generation in social media posts,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 7, pp. 3199–3213, 2023.
  6. Y. Du, F. Wei, Z. Zhang, M. Shi, Y. Gao, and G. Li, “Learning to prompt for open-vocabulary object detection with vision-language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 084–14 093.
  7. D. Kim, A. Angelova, and W. Kuo, “Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers,” CoRR, vol. abs/2305.07011, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.07011
  8. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  9. M. Wu, J. Gu, Y. Shen, M. Lin, C. Chen, and X. Sun, “End-to-end zero-shot hoi detection via vision and language knowledge distillation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 2839–2846.
  10. X. Qu, C. Ding, X. Li, X. Zhong, and D. Tao, “Distillation using oracle queries for transformer-based human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 558–19 567.
  11. S. Liu, L. Zhang, X. Yang, H. Su, and J. Zhu, “Query2label: A simple transformer way to multi-label classification,” arXiv preprint arXiv:2107.10834, 2021.
  12. B. Wan, Y. Liu, D. Zhou, T. Tuytelaars, and X. He, “Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning,” arXiv preprint arXiv:2303.01313, 2023.
  13. Z. Li, C. Zou, Y. Zhao, B. Li, and S. Zhong, “Improving human-object interaction detection via phrase learning and label composition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1509–1517.
  14. Y. Liu, Q. Chen, and A. Zisserman, “Amplifying key cues for human-object-interaction detection,” in European Conference on Computer Vision.   Springer, 2020, pp. 248–265.
  15. Y. Cao, Q. Tang, F. Yang, X. Su, S. You, X. Lu, and C. Xu, “Re-mine, learn and reason: Exploring the cross-modal semantic correlations for language-guided hoi detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 492–23 503.
  16. Y. Cheng, Z. Wang, W. Zhan, and H. Duan, “Multi-scale human-object interaction detector,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 4, pp. 1827–1838, 2023.
  17. Y. Wang, Q. Liu, and Y. Lei, “Ted-net: Dispersal attention for perceiving interaction region in indirectly-contact hoi detection,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2024.
  18. Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Detecting human-object interaction via fabricated compositional learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 646–14 655.
  19. F. Z. Zhang, D. Campbell, and S. Gould, “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 104–20 112.
  20. Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu, “GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20 091–20 100.
  21. B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim, “Hotr: End-to-end human-object interaction detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 74–83.
  22. X. Liu, Y.-L. Li, X. Wu, Y.-W. Tai, C. Lu, and C.-K. Tang, “Interactiveness field in human-object interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 113–20 122.
  23. H. Yuan, J. Jiang, S. Albanie, T. Feng, Z. Huang, D. Ni, and M. Tang, “Rlip: Relational language-image pre-training for human-object interaction detection,” Advances in Neural Information Processing Systems, vol. 35, pp. 37 416–37 431, 2022.
  24. S. Kim, D. Jung, and M. Cho, “Relational Context Learning for Human-Object Interaction Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2925–2934.
  25. D. Zhou, Z. Liu, J. Wang, L. Wang, T. Hu, E. Ding, and J. Wang, “Human-object interaction detection via disentangled transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 568–19 577.
  26. D. Yang, Y. Zou, C. Zhang, M. Cao, and J. Chen, “Rr-net: Relation reasoning for end-to-end human-object interaction detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3853–3865, 2022.
  27. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision.   Springer, 2020, pp. 213–229.
  28. M. Tamura, H. Ohashi, and T. Yoshinaga, “Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 410–10 419.
  29. S. Ning, L. Qiu, Y. Liu, and X. He, “HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 23 507–23 517.
  30. L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
  31. M. Dzabraev, M. Kalashnikov, S. Komkov, and A. Petiushko, “Mdmmt: Multidomain multimodal transformer for video retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3354–3363.
  32. J. Li, W. K. Wong, L. Jiang, X. Fang, S. Xie, and Y. Xu, “Ckdh: Clip-based knowledge distillation hashing for cross-modal retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2024.
  33. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
  34. W. Jin, Y. Cheng, Y. Shen, W. Chen, and X. Ren, “A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models,” arXiv preprint arXiv:2110.08484, 2021.
  35. X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.
  36. J. Peyre, J. Sivic, I. Laptev, and C. Schmid, “Detecting unseen visual relations using analogies,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1981–1990.
  37. Y. Liu, J. Yuan, and C. W. Chen, “Consnet: Learning consistency graph for zero-shot human-object interaction detection,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4235–4243.
  38. Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Affordance transfer learning for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 495–504.
  39. Y. Li, G. Yuan, Y. Wen, J. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, and J. Ren, “Efficientformer: Vision transformers at mobilenet speed,” Advances in Neural Information Processing Systems, vol. 35, pp. 12 934–12 949, 2022.
  40. Z. Hou, X. Peng, Y. Qiao, and D. Tao, “Visual compositional learning for human-object interaction detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16.   Springer, 2020, pp. 584–600.
  41. Z. Hou, B. Yu, and D. Tao, “Discovering human-object interaction concepts via self-compositional learning,” in European Conference on Computer Vision.   Springer, 2022, pp. 461–478.
  42. A. Bansal, S. S. Rambhatla, A. Shrivastava, and R. Chellappa, “Detecting human-object interactions via functional generalization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 460–10 469.
  43. Y.-L. Li, X. Liu, X. Wu, Y. Li, and C. Lu, “Hoi analysis: Integrating and decomposing human-object interaction,” Advances in Neural Information Processing Systems, vol. 33, pp. 5011–5022, 2020.
  44. F. Z. Zhang, D. Campbell, and S. Gould, “Spatially conditioned graphs for detecting human-object interactions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 319–13 327.
  45. A. Zhang, Y. Liao, S. Liu, M. Lu, Y. Wang, C. Gao, and X. Li, “Mining the benefits of two-stage and one-stage hoi detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 209–17 220, 2021.
  46. D. Tu, X. Min, H. Duan, G. Guo, G. Zhai, and W. Shen, “Iwin: Human-object interaction detection via transformer with irregular windows,” in European Conference on Computer Vision.   Springer, 2022, pp. 87–103.
  47. T. Lei, F. Caba, Q. Chen, H. Jin, Y. Peng, and Y. Liu, “Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6480–6490.
  48. S. Zheng, B. Xu, and Q. Jin, “Open-category Human-Object Interaction Pre-Training via Language Modeling Framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 392–19 402.
  49. Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng, “Learning to detect human-object interactions,” in 2018 ieee winter conference on applications of computer vision (wacv).   IEEE, 2018, pp. 381–389.
  50. S. Gupta and J. Malik, “Visual semantic role labeling,” arXiv preprint arXiv:1505.04474, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Weiying Xue (5 papers)
  2. Qi Liu (485 papers)
  3. Qiwei Xiong (3 papers)
  4. Yuxiao Wang (21 papers)
  5. Zhenao Wei (5 papers)
  6. Xiaofen Xing (29 papers)
  7. Xiangmin Xu (54 papers)

Summary

We haven't generated a summary for this paper yet.