Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection (2404.06194v2)
Abstract: Open-vocabulary human-object interaction (HOI) detection, which is concerned with the problem of detecting novel HOIs guided by natural language, is crucial for understanding human-centric scenes. However, prior zero-shot HOI detectors often employ the same levels of feature maps to model HOIs with varying distances, leading to suboptimal performance in scenes containing human-object pairs with a wide range of distances. In addition, these detectors primarily rely on category names and overlook the rich contextual information that language can provide, which is essential for capturing open vocabulary concepts that are typically rare and not well-represented by category names alone. In this paper, we introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement (CMD-SE), harnessing the potential of Visual-LLMs (VLMs). Specifically, we propose to model human-object pairs with different distances with different levels of feature maps by incorporating a soft constraint during the bipartite matching process. Furthermore, by leveraging LLMs such as GPT models, we exploit their extensive world knowledge to generate descriptions of human body part states for various interactions. Then we integrate the generalizable and fine-grained semantics of human body parts to improve interaction recognition. Experimental results on two datasets, SWIG-HOI and HICO-DET, demonstrate that our proposed method achieves state-of-the-art results in open vocabulary HOI detection. The code and models are available at https://github.com/ltttpku/CMD-SE-release.
- Detecting human-object interactions via functional generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10460–10469, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Detecting any human-object interaction relationship: Universal hoi detector with spatial prompt learning on foundation models, 2023a.
- Re-mine, learn and reason: Exploring the cross-modal semantic correlations for language-guided hoi detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23492–23503, 2023b.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Learning to detect human-object interactions. In 2018 ieee winter conference on applications of computer vision (wacv), pages 381–389. IEEE, 2018.
- Reformulating hoi detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9004–9013, 2021.
- Category-aware transformer network for better human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19538–19547, 2022.
- Link the head to the" beak": Zero shot learning from noisy text description at part precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5640–5649, 2017.
- Christiane Fellbaum. WordNet: An electronic lexical database. MIT press, 1998.
- ican: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018.
- Drg: Dual relation graph for human-object interaction detection. In European Conference on Computer Vision, pages 696–712. Springer, 2020.
- Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8359–8367, 2018.
- No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9677–9685, 2019.
- Fine-grained image classification via combining vision and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5994–6002, 2017.
- Visual compositional learning for human-object interaction detection. In European Conference on Computer Vision, pages 584–600. Springer, 2020.
- Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 495–504, 2021a.
- Detecting human-object interaction via fabricated compositional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14646–14655, 2021b.
- What to look at and where: Semantic and spatial refined transformer for detecting human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5353–5363, 2022.
- Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors, 2024.
- Compositional learning for human object interaction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 234–251, 2018.
- Multi-modal classifiers for open-vocabulary object detection, 2023.
- Uniondet: Union-level detector towards real-time human-object interaction detection. In European Conference on Computer Vision, pages 498–514. Springer, 2020.
- Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 74–83, 2021.
- Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19578–19587, 2022.
- Relational context learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2925–2934, 2023.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Efficient adaptive human-object interaction detection with concept-guided memory. 2023.
- Zero-shot visual relation detection via composite visual cues from large language models, 2023.
- Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3585–3594, 2019.
- Pastanet: Toward human activity knowledge engine. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 382–391, 2020.
- Improving human-object interaction detection via phrase learning and label composition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1509–1517, 2022.
- Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 482–490, 2020.
- Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20123–20132, 2022.
- Interactiveness field in human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20113–20122, 2022.
- Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
- I2dformer: Learning image to document attention for zero-shot image classification. Advances in Neural Information Processing Systems, 35:12283–12294, 2022.
- I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15169–15179, 2023.
- Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23507–23517, 2023.
- Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342–26362. PMLR, 2023.
- Viplo: Vision transformer based pose-conditioned self-loop graph for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17152–17162, 2023.
- What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691–15701, 2023.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 49–58, 2016.
- Integrating language guidance into vision-based deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16177–16189, 2022.
- Waffling around for performance: Visual classification with random words and broad concepts, 2023.
- K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35:15558–15573, 2022.
- Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10410–10419, 2021.
- Agglomerative transformer for human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21614–21624, 2023.
- Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13617–13626, 2020.
- Weakly-supervised hoi detection from interaction labels only and language/vision-language priors, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Discovering human interactions with large-vocabulary objects via query and multi-scale detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13475–13484, 2021.
- Learning transferable human-object interaction detectors with natural language supervision. In CVPR, 2022.
- End-to-end zero-shot hoi detection via vision and language knowledge distillation. arXiv preprint arXiv:2204.03541, 2022.
- Category query learning for human-object interaction classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15275–15284, 2023.
- A graph-based interactive reasoning for human-object interaction detection. arXiv preprint arXiv:2007.06925, 2020.
- Contextual object detection with multimodal large language models, 2023.
- Mining the benefits of two-stage and one-stage hoi detection. Advances in Neural Information Processing Systems, 34:17209–17220, 2021a.
- Spatially conditioned graphs for detecting human–object interactions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13319–13327, 2021b.
- Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20104–20112, 2022a.
- Exploring predicate visual context in detecting human–object interactions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10411–10421, 2023.
- Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19548–19557, 2022b.
- Open-category human-object interaction pre-training via language modeling framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19392–19402, 2023.
- Polysemy deciphering network for human-object interaction detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 69–85. Springer, 2020.
- Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13234–13243, 2021.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
- Relation parsing neural network for human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 843–851, 2019.
- Ting Lei (17 papers)
- Shaofeng Yin (5 papers)
- Yang Liu (2253 papers)