Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model (2404.12678v3)
Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-LLM CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal fusion. Ultimately, our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
- “Spatially conditioned graphs for detecting human-object interactions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13319–13327.
- “Exploring structure-aware transformer over interaction proposals for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19548–19557.
- “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20104–20112.
- “Exploring predicate visual context in detecting of human-object interactions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10411–10421.
- “Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10410–10419.
- “Mining the benefits of two-stage and one-stage hoi detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 17209–17220, 2021.
- “Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20123–20132.
- “Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23507–23517.
- “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- “End-to-end zero-shot hoi detection via vision and language knowledge distillation,” in Proceedings of the AAAI Conference on artificial intelligence, 2023, vol. 37, pp. 2839–2846.
- “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- “Mask r-cnn,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2961–2969.
- “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
- “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- “Detecting human-object interactions with object-guided cross-modal calibrated semantics,” in Proceedings of the AAAI Conference on artificial intelligence, 2022, vol. 36, pp. 3206–3214.
- “Human-object interaction detection via disentangled transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19568–19577.
- “Rlip: Relational language-image pre-training for human-object interaction detection,” Advances in Neural Information Processing Systems, vol. 35, pp. 37416–37431, 2022.
- “Efficient adaptive human-object interaction detection with concept-guided memory,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6480–6490.
- “Zegclip: Towards adapting clip for zero-shot semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11175–11185.
- “Focal loss for dense object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2980–2988.
- “Learning to detect human-object interactions,” in 2018 ieee winter conference on applications of computer vision (wacv), 2018, pp. 381–389.
- “Visual semantic role labeling,” arXiv preprint arXiv:1505.04474, 2015.
- “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- “Detrs with hybrid matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19702–19712.
- “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
- “Detecting human-object interaction via fabricated compositional learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14646–14655.
- Jihao Dong (1 paper)
- Renjie Pan (2 papers)
- Hua Yang (32 papers)