Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection (2404.06194v2)

Published 9 Apr 2024 in cs.CV

Abstract: Open-vocabulary human-object interaction (HOI) detection, which is concerned with the problem of detecting novel HOIs guided by natural language, is crucial for understanding human-centric scenes. However, prior zero-shot HOI detectors often employ the same levels of feature maps to model HOIs with varying distances, leading to suboptimal performance in scenes containing human-object pairs with a wide range of distances. In addition, these detectors primarily rely on category names and overlook the rich contextual information that language can provide, which is essential for capturing open vocabulary concepts that are typically rare and not well-represented by category names alone. In this paper, we introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement (CMD-SE), harnessing the potential of Visual-LLMs (VLMs). Specifically, we propose to model human-object pairs with different distances with different levels of feature maps by incorporating a soft constraint during the bipartite matching process. Furthermore, by leveraging LLMs such as GPT models, we exploit their extensive world knowledge to generate descriptions of human body part states for various interactions. Then we integrate the generalizable and fine-grained semantics of human body parts to improve interaction recognition. Experimental results on two datasets, SWIG-HOI and HICO-DET, demonstrate that our proposed method achieves state-of-the-art results in open vocabulary HOI detection. The code and models are available at https://github.com/ltttpku/CMD-SE-release.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Detecting human-object interactions via functional generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10460–10469, 2020.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Detecting any human-object interaction relationship: Universal hoi detector with spatial prompt learning on foundation models, 2023a.
  4. Re-mine, learn and reason: Exploring the cross-modal semantic correlations for language-guided hoi detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23492–23503, 2023b.
  5. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  6. Learning to detect human-object interactions. In 2018 ieee winter conference on applications of computer vision (wacv), pages 381–389. IEEE, 2018.
  7. Reformulating hoi detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9004–9013, 2021.
  8. Category-aware transformer network for better human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19538–19547, 2022.
  9. Link the head to the" beak": Zero shot learning from noisy text description at part precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5640–5649, 2017.
  10. Christiane Fellbaum. WordNet: An electronic lexical database. MIT press, 1998.
  11. ican: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018.
  12. Drg: Dual relation graph for human-object interaction detection. In European Conference on Computer Vision, pages 696–712. Springer, 2020.
  13. Detecting and recognizing human-object interactions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8359–8367, 2018.
  14. No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9677–9685, 2019.
  15. Fine-grained image classification via combining vision and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5994–6002, 2017.
  16. Visual compositional learning for human-object interaction detection. In European Conference on Computer Vision, pages 584–600. Springer, 2020.
  17. Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 495–504, 2021a.
  18. Detecting human-object interaction via fabricated compositional learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14646–14655, 2021b.
  19. What to look at and where: Semantic and spatial refined transformer for detecting human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5353–5363, 2022.
  20. Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors, 2024.
  21. Compositional learning for human object interaction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 234–251, 2018.
  22. Multi-modal classifiers for open-vocabulary object detection, 2023.
  23. Uniondet: Union-level detector towards real-time human-object interaction detection. In European Conference on Computer Vision, pages 498–514. Springer, 2020.
  24. Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 74–83, 2021.
  25. Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19578–19587, 2022.
  26. Relational context learning for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2925–2934, 2023.
  27. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  28. Efficient adaptive human-object interaction detection with concept-guided memory. 2023.
  29. Zero-shot visual relation detection via composite visual cues from large language models, 2023.
  30. Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3585–3594, 2019.
  31. Pastanet: Toward human activity knowledge engine. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 382–391, 2020.
  32. Improving human-object interaction detection via phrase learning and label composition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1509–1517, 2022.
  33. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 482–490, 2020.
  34. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20123–20132, 2022.
  35. Interactiveness field in human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20113–20122, 2022.
  36. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
  37. I2dformer: Learning image to document attention for zero-shot image classification. Advances in Neural Information Processing Systems, 35:12283–12294, 2022.
  38. I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15169–15179, 2023.
  39. Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23507–23517, 2023.
  40. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342–26362. PMLR, 2023.
  41. Viplo: Vision transformer based pose-conditioned self-loop graph for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17152–17162, 2023.
  42. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691–15701, 2023.
  43. Improving language understanding by generative pre-training. 2018.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  45. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  46. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 49–58, 2016.
  47. Integrating language guidance into vision-based deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16177–16189, 2022.
  48. Waffling around for performance: Visual classification with random words and broad concepts, 2023.
  49. K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35:15558–15573, 2022.
  50. Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10410–10419, 2021.
  51. Agglomerative transformer for human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21614–21624, 2023.
  52. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13617–13626, 2020.
  53. Weakly-supervised hoi detection from interaction labels only and language/vision-language priors, 2023.
  54. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  55. Discovering human interactions with large-vocabulary objects via query and multi-scale detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13475–13484, 2021.
  56. Learning transferable human-object interaction detectors with natural language supervision. In CVPR, 2022.
  57. End-to-end zero-shot hoi detection via vision and language knowledge distillation. arXiv preprint arXiv:2204.03541, 2022.
  58. Category query learning for human-object interaction classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15275–15284, 2023.
  59. A graph-based interactive reasoning for human-object interaction detection. arXiv preprint arXiv:2007.06925, 2020.
  60. Contextual object detection with multimodal large language models, 2023.
  61. Mining the benefits of two-stage and one-stage hoi detection. Advances in Neural Information Processing Systems, 34:17209–17220, 2021a.
  62. Spatially conditioned graphs for detecting human–object interactions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13319–13327, 2021b.
  63. Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20104–20112, 2022a.
  64. Exploring predicate visual context in detecting human–object interactions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10411–10421, 2023.
  65. Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19548–19557, 2022b.
  66. Open-category human-object interaction pre-training via language modeling framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19392–19402, 2023.
  67. Polysemy deciphering network for human-object interaction detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 69–85. Springer, 2020.
  68. Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13234–13243, 2021.
  69. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  70. Relation parsing neural network for human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 843–851, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ting Lei (17 papers)
  2. Shaofeng Yin (5 papers)
  3. Yang Liu (2253 papers)
Citations (7)

Summary

The paper "Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection" proposes a novel framework for open-vocabulary human-object interaction (HOI) detection, with a focus on leveraging large foundation models in the form of Visual-LLMs (VLMs) and LLMs. The goal of open-vocabulary HOI detection is to accurately identify and interpret interactions involving human and object pairs described by arbitrary text inputs, accommodating novel or unseen interactions not encountered during the training phase.

The authors observe two main challenges in existing zero-shot HOI detection methods: (1) the use of uniform levels of feature maps for modeling human-object pairs across varying distances, leading to suboptimal performance, and (2) an overreliance on category names that neglects the rich contextual information afforded by natural language, which can capture open vocabulary concepts effectively.

To address these issues, the paper introduces an end-to-end framework with Conditional Multi-level Decoding and Semantic Enhancement (CMD-SE). The core contributions of this framework are as follows:

  1. Conditional Multi-level Decoding (CMD): The framework proposes the use of different levels of feature maps to better model human-object interactions with varying spatial distances. By integrating a soft constraint during the bipartite matching process, low- and high-level feature maps are aligned with interactions involving different human-object pair distances, thereby improving recognition performance.
  2. Fine-grained Semantic Enhancement (SE): The authors incorporate linguistic context derived from LLMs, such as GPT models, to generate detailed descriptions of human body part states for various interactions. This linguistic context provides generalizable and fine-grained semantics, enhancing both recognition accuracy and the model's ability to differentiate among HOI concepts.
  3. Experimental Validation: The method was validated against two datasets, SWIG-HOI and HICO-DET, demonstrating state-of-the-art performance. Specifically, CMD-SE achieved significant improvements in recognition accuracy, emphasizing the efficacy of the combination of multi-level feature mapping and enhanced linguistic context in open-vocabulary detection tasks.

The paper offers insights into bridging vision and language modalities for HOI detection, proposing a scalable framework that expands beyond predefined interaction categories. The use of multi-level feature maps tailored to interaction distances and the novel utilization of fine-grained body part descriptions represent significant methodological advancements, particularly relevant for scenarios where text-based interaction descriptions are required.

Overall, the paper demonstrates a powerful synergy between visual and LLMs for open-vocabulary detection tasks, with strong numerical results across benchmark datasets, thus underscoring the framework's capacity for extensive generalization beyond traditional closed-set methodologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com