Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation (2311.16492v2)

Published 27 Nov 2023 in cs.CV

Abstract: Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in LLMs, we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations. Code is available at \url{https://github.com/franciszzj/TP-SIS}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding, pages 33–45, 2018.
  2. Reasoning with scene graphs for robot planning under partial observability. IEEE Robotics and Automation Letters, pages 5560–5567, 2022.
  3. Anthropic. Claude, 2023.
  4. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  5. End-to-end object detection with transformers. In ECCV, 2020.
  6. Counterfactual critic multi-agent training for scene graph generation. In ICCV, 2019.
  7. Large language models are visual reasoning coordinators. arXiv preprint arXiv:2310.15166, 2023.
  8. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In CVPR, 2020.
  9. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022a.
  10. Visual relationship detection: A survey. IEEE Transactions on Cybernetics, pages 8453–8466, 2022b.
  11. Reltr: Relation transformer for scene graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  12. Detecting visual relationships with deep relational networks. In CVPR, 2017.
  13. Hierarchical memory learning for fine-grained scene graph generation. In ECCV, 2022.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation. In CVPR, 2022.
  16. Visual relationship detection with low rank non-negative tensor decomposition. In AAAI, 2020.
  17. Image captioning with scene-graph based semantic concepts. In ICMLC, 2018.
  18. Ross Girshick. Fast r-cnn. In ICCV, 2015.
  19. Google. Bard, 2023.
  20. Scene graph generation with external knowledge and image reconstruction. In CVPR, 2019.
  21. Deep residual learning for image recognition. In CVPR, 2016.
  22. Towards open-vocabulary scene graph generation with prompt-based finetuning. In ECCV, 2022.
  23. Scene graph reasoning for visual question answering. arXiv preprint arXiv:2007.01072, 2020.
  24. Neural message passing for visual relationship detection. In ICMLW, 2019.
  25. Inject semantic concepts into image tagging for open-set recognition. arXiv preprint arXiv:2310.15200, 2023.
  26. Contextual translation embedding for visual relationship detection and scene graph generation. IEEE transactions on pattern analysis and machine intelligence, pages 3820–3832, 2020.
  27. Tensorize, factorize and regularize: Robust visual relationship learning. In CVPR, 2018.
  28. Skew class-balanced re-weighting for unbiased scene graph generation. Machine Learning and Knowledge Extraction, pages 287–303, 2023.
  29. Panoptic segmentation. In CVPR, 2019.
  30. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  31. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, pages 32–73, 2017.
  32. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  33. Panoptic scene graph generation with semantics-prototype learning. arXiv preprint arXiv:2307.15567, 2023.
  34. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In CVPR, 2021.
  35. Sgtr: End-to-end scene graph generation with transformer. In CVPR, 2022.
  36. Vip-cnn: Visual phrase guided convolutional neural network. In CVPR, 2017a.
  37. Scene graph generation from objects, phrases and region captions. In ICCV, 2017b.
  38. Natural language guided visual relationship detection. In CVPRW, 2019.
  39. Microsoft coco: Common objects in context. In ECCV, 2014.
  40. Gps-net: Graph property sensing network for scene graph generation. In CVPR, 2020.
  41. What large language models bring to text-rich vqa? arXiv preprint arXiv:2311.07306, 2023.
  42. Decoupled weight decay regularization. In ICLR, 2019.
  43. Visual relationship detection with language priors. In ECCV, 2016.
  44. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, pages 1–40, 2023.
  45. OpenAI. Chatgpt, 2022.
  46. Weakly-supervised learning of visual relations. In ICCV, 2017.
  47. Attentive relational networks for mapping images to scene graphs. In CVPR, 2019.
  48. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492–504, 2023.
  49. Explainable and explicit visual reasoning over scene graphs. In CVPR, 2019.
  50. Relationformer: A unified framework for image-to-graph generation. In ECCV, 2022.
  51. Scene graph contrastive learning for embodied navigation. In ICCV, 2023.
  52. Zlpr: A novel loss for multi-label classification. arXiv preprint arXiv:2208.02955, 2022.
  53. Cotdet: Affordance knowledge prompting for task driven object detection. In ICCV, 2023.
  54. Learning to compose dynamic tree structures for visual contexts. In CVPR, 2019.
  55. Unbiased scene graph generation from biased training. In CVPR, 2020.
  56. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  57. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  58. Multimodal large language model for visual navigation. arXiv preprint arXiv:2310.08669, 2023.
  59. Attention is all you need. NeurIPS, 2017.
  60. Pair then relation: Pair-net for panoptic scene graph generation, 2023a.
  61. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023b.
  62. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
  63. A simple baseline for knowledge-based visual question answering. arXiv preprint arXiv:2310.13570, 2023.
  64. Scene graph generation by iterative message passing. In CVPR, 2017.
  65. Panoptic scene graph generation. In ECCV, 2022.
  66. Cogtree: Cognition tree loss for unbiased scene graph generation. In IJCAI, 2021.
  67. Visually-prompted language model for fine-grained scene graph generation in an open world. arXiv preprint arXiv:2303.13233, 2023.
  68. Bridging knowledge graphs to generate scene graphs. In ECCV, 2020.
  69. Neural motifs: Scene graph parsing with global context. In CVPR, 2018.
  70. Fine-grained scene graph generation with data transfer. In ECCV, 2022.
  71. Next-chat: An lmm for chat, detection and segmentation. arXiv preprint arXiv:2311.04498, 2023.
  72. Visual translation embedding network for visual relation detection. In CVPR, 2017.
  73. Large-scale visual relationship understanding. In AAAI, 2019a.
  74. Graphical contrastive losses for scene graph parsing. In CVPR, 2019b.
  75. Visual relation detection with multi-level attention. In ACM MM, 2019.
  76. Text promptable surgical instrument segmentation with vision-language models. arXiv preprint arXiv:2306.09244, 2023a.
  77. Hilo: Exploiting high low frequency relations for unbiased panoptic scene graph generation. In ICCV, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zijian Zhou (63 papers)
  2. Miaojing Shi (53 papers)
  3. Holger Caesar (31 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com