HOKEM: Human and Object Keypoint-based Extension Module for Human-Object Interaction Detection (2306.14260v1)
Abstract: Human-object interaction (HOI) detection for capturing relationships between humans and objects is an important task in the semantic understanding of images. When processing human and object keypoints extracted from an image using a graph convolutional network (GCN) to detect HOI, it is crucial to extract appropriate object keypoints regardless of the object type and to design a GCN that accurately captures the spatial relationships between keypoints. This paper presents the human and object keypoint-based extension module (HOKEM) as an easy-to-use extension module to improve the accuracy of the conventional detection models. The proposed object keypoint extraction method is simple yet accurately represents the shapes of various objects. Moreover, the proposed human-object adaptive GCN (HO-AGCN), which introduces adaptive graph optimization and attention mechanism, accurately captures the spatial relationships between keypoints. Experiments using the HOI dataset, V-COCO, showed that HOKEM boosted the accuracy of an appearance-based model by a large margin.
- Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” Int. Jour. Computer Vision (IJCV), vol. 130, no. 5, pp. 1366–1401, 2022.
- G. Gkioxari, R. Girshick, P. Dollár, and K. He, “Detecting and recognizing human-object interactions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8359–8367.
- C. Gao, Y. Zou, and J-B. Huang, “Ican: Instance-centric attention network for human-object interaction detection,” arXiv preprint arXiv:1808.10437, 2018.
- B. Kim, T. Choi, J. Kang, and H J. Kim, “Uniondet: Union-level detector towards real-time human-object interaction detection,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 498–514.
- O. Ulutan, A. S. M. Iftekhar, and B.S. Manjunath, “Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 13617–13626.
- Z. Cao, T. Simon, S-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291–7299.
- Z. Geng, K. Sun, B. Xiao, Z. Zhang, and J. Wang, “Bottom-up human pose estimation via disentangled keypoint regression,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 14676–14686.
- P. Zhou and M. Chi, “Relation parsing neural network for human-object interaction detection,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 843–851.
- B. Wan, D. Zhou, Y. Liu, R. Li, and X. He, “Pose-aware multi-level feature network for human object interaction detection,” in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2019, pp. 9469–9478.
- X. Zhong, C. Ding, X. Qu, and D. Tao, “Polysemy deciphering network for human-object interaction detection,” in Proc. European Conf. Computer Vision (ECCV), 2020, pp. 69–85.
- D-J. Kim, X. Sun, J. Choi, S. Lin, and I S. Kweon, “Detecting human-object interactions with action co-occurrence priors,” in Proc. European Conf. Computer Vision (ECCV), 2020, pp. 718–736.
- Z. Liang, J. Liu, Y. Guan, and J. Rojas, “Pose-based modular network for human-object interaction detection,” arXiv preprint arXiv:2008.02042, 2020.
- L. Liu and R.T. Tan, “Human object interaction detection using two-direction spatial enhancement and exclusive object prior,” Pattern Recognition, vol. 124, pp. 108438, 2022.
- M. Zhu, E. SL Ho, and H. PH Shum, “A skeleton-aware graph convolutional network for human-object interaction detection,” in Proc. IEEE Int. Conf. Systems, Man, and Cybernetics (SMC), 2022, pp. 474–491.
- S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proc. AAAI Conf. Artificial Intelligence (AAAI), 2018, vol. 32, no. 1.
- L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12026–12035.
- Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Constructing stronger and faster baselines for skeleton-based action recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), 2022.
- T Y. Zhang and C Y. Suen, “A fast parallel algorithm for thinning digital patterns,” Communications of the ACM, vol. 27, no. 3, pp. 236–239, 1984.
- Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 13713–13722.
- S. Gupta and J. Malik, “Visual semantic role labeling,” arXiv preprint arXiv:1505.04474, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2017, pp. 2961–2969.
- B. Kim, J. Lee, J. Kang, E-S. Kim, and H J. Kim, “Hotr: End-to-end human-object interaction detection with transformers,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 74–83.