Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer (2404.15785v1)

Published 24 Apr 2024 in cs.CV

Abstract: Benefiting from strong generalization ability, pre-trained vision LLMs (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. 2022. Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In CVPR.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
  3. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  4. Language models are few-shot learners. NeurIPS 33 (2020), 1877–1901.
  5. Hico: A benchmark for recognizing human-object interactions in images. In ICCV. 1017–1025.
  6. Neural clustering based visual representation learning. In CVPR.
  7. Zero-Shot Visual Recognition Using Semantics-Preserving Adversarial Embedding Networks. In CVPR.
  8. Gsrformer: Grounded situation recognition transformer with alternate semantic attention refinement. In ACM MM. 3272–3281.
  9. Recovering the unbiased scene graphs from the biased ones. In ACM International Conference on Multimedia.
  10. Collaborative transformers for grounded situation recognition. In CVPR. 19659–19668.
  11. KR1442 Chowdhary and KR Chowdhary. 2020. Natural language processing. (2020), 603–649.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  13. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
  14. Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines (2020), 681–694.
  15. Texts as images in prompt tuning for multi-label image recognition. In CVPR. 2808–2817.
  16. Detecting human-object interactions with action co-occurrence priors. In ECCV. 718–736.
  17. Hyperbolic learning with synthetic captions for open-world detection. In CVPR.
  18. GPT2: Empirical slant delay model for radio space geodetic techniques. (2013), 1069–1073.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  20. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML. PMLR, 12888–12900.
  21. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems (2021), 9694–9705.
  22. Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation. In ACM MM. 1485–1494.
  23. The devil is in the labels: Noisy label correction for robust scene graph generation. In CVPR. 18869–18878.
  24. Nicest: Noisy label correction and training for robust scene graph generation. arXiv preprint arXiv:2207.13316 (2022).
  25. Panoptic scene graph generation with semantics-prototype learning. In AAAI, Vol. 38. 3145–3153.
  26. Zero-shot visual relation detection via composite visual cues from large language models. NeurIPS 36 (2024).
  27. Grounded language-image pre-training. In CVPR. 10965–10975.
  28. Clip-event: Connecting text and images with event structures. In CVPR. 16420–16429.
  29. Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge. In ICCV. 2851–2862.
  30. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
  31. Dengsheng Lu and Qihao Weng. 2007. A survey of image classification methods and techniques for improving classification performance. International journal of Remote sensing 28, 5 (2007), 823–870.
  32. Sachit Menon and Carl Vondrick. 2022. Visual Classification via Description from Large Language Models. In ICLR.
  33. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. In EMNLP. 11048–11064.
  34. Chils: Zero-shot image classification with hierarchical label sets. In ICML.
  35. A review of generalized zero-shot learning methods. IEEE TPAMI 45, 4 (2022), 4051–4070.
  36. Grounded situation recognition. In ECCV. Springer, 314–332.
  37. Learning transferable visual models from natural language supervision. In ICML. 8748–8763.
  38. Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In ICML. PMLR, 2152–2161.
  39. Advancing visual grounding with scene knowledge: Benchmark and method. In CVPR. 15039–15049.
  40. Unbiased scene graph generation from biased training. In CVPR. 3716–3725.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  42. Rethinking the two-stage framework for grounded situation recognition. In AAAI, Vol. 36. 2651–2658.
  43. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE TPAMI 41, 9 (2018), 2251–2265.
  44. Feature generating networks for zero-shot learning. In CVPR. 5542–5551.
  45. Scene graph generation by iterative message passing. In CVPR. 5410–5419.
  46. Towards Unified Interactive Visual Grounding in The Wild. In ICRA.
  47. Learning concise and descriptive attributes for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3090–3100.
  48. Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In ACM MM. 265–273.
  49. Chen Yang and Thomas A Cleland. 2024. Annolid: Annotate, Segment, and Track Anything You Need. arXiv preprint arXiv:2403.18690 (2024).
  50. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In CVPR. 19187–19197.
  51. Situation recognition: Visual semantic role labeling for image understanding. 5534–5542.
  52. Visually-prompted language model for fine-grained scene graph generation in an open world. In ICCV. 21560–21571.
  53. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv. org/abs/2205.01068 (2023), 19–0.
  54. Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models. arXiv preprint arXiv:2310.08873 (2023).
  55. Object detection with deep learning: A review. 30, 11 (2019), 3212–3232.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lin Li (329 papers)
  2. Chunping Wang (23 papers)
  3. Jun Xiao (134 papers)
  4. Long Chen (395 papers)
  5. JiaMing Lei (3 papers)