Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation (2310.10404v8)

Published 16 Oct 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., LLM for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Probabilistic debiasing of scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10429–10438, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  4. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7373–7382, 2021.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19427–19436, 2022.
  9. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  10. Towards open-vocabulary scene graph generation with prompt-based finetuning. In European Conference on Computer Vision, pages 56–73. Springer, 2022.
  11. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  12. Iterative scene graph generation. Advances in Neural Information Processing Systems, 35:24295–24308, 2022.
  13. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  14. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  15. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021a.
  16. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  17. Compositional feature augmentation for unbiased scene graph generation. arXiv preprint arXiv:2308.06712, 2023b.
  18. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022a.
  19. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11109–11119, 2021b.
  20. Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4204–4213, 2022b.
  21. Visual-semantic matching by exploring high-order attention and distraction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12795, 2020.
  22. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  23. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  24. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  25. George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  26. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  27. OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2023a.
  28. OpenAI. Gpt-4. Technical report, 2023b.
  29. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  30. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pages 70–80, 2015.
  31. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  32. Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8376–8384, 2019.
  33. A simple baseline for weakly-supervised scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16393–16402, 2021.
  34. Energy-based learning for scene graph generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13936–13945, 2021.
  35. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3716–3725, 2020.
  36. Graph-structured representations for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2017.
  37. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  39. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1508–1517, 2020.
  40. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
  42. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023.
  43. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6609–6618, 2019.
  44. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5419, 2017.
  45. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10685–10694, 2019.
  46. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  47. Linguistic structures as weak supervision for visual scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8289–8299, 2021.
  48. Unbiased heterogeneous scene graph generation with relation-aware message passing neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3285–3294, 2023.
  49. Weakly supervised visual semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3736–3745, 2020.
  50. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5831–5840, 2018.
  51. Fine-grained scene graph generation with data transfer. In European conference on computer vision, pages 409–424. Springer, 2022a.
  52. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2021.
  53. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.
  54. Learning to generate language-supervised and open-vocabulary scene graph using pre-trained visual-semantic space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2915–2924, 2023.
  55. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022c.
  56. Comprehensive image captioning via scene graph decomposition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 211–229. Springer, 2020.
  57. Learning to generate scene graph from natural language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1823–1834, 2021.
  58. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Kibum Kim (16 papers)
  2. Kanghoon Yoon (16 papers)
  3. Jaehyeong Jeon (3 papers)
  4. Yeonjun In (17 papers)
  5. Jinyoung Moon (13 papers)
  6. Donghyun Kim (129 papers)
  7. Chanyoung Park (83 papers)
Citations (7)

Summary

Exploring LLMs for Weakly Supervised Scene Graph Generation

The research paper focuses on the application of LLMs in Weakly-Supervised Scene Graph Generation (WSSGG) to address the challenges of cost and complexity involved in fully-supervised scene graph generation approaches. The work introduces a novel method, LLM4SGG, which leverages LLMs to enhance the process of extracting structured visual knowledge from images by parsing image captions and aligning the parsed content with target entity and predicate classes.

Background and Problem Statement

Scene Graph Generation (SGG) is a crucial task in computer vision that involves identifying objects and their relationships within an image, often requiring extensive data annotation. The traditional fully-supervised SGG methods are heavily dependent on annotated datasets, which are labor-intensive and expensive to produce. To reduce the reliance on such costly annotations, WSSGG has emerged, utilizing easily-obtainable image captions instead.

Nevertheless, two significant issues affect the WSSGG approaches: (1) semantic over-simplification during triplet formation, where complex predicates are reduced to less informative terms, resulting in an imbalanced predicate distribution, and (2) low-density scene graphs due to the inadequate alignment of parsed triplets with the desired entity and predicate classes, leading to insufficient supervision.

Methodology

The LLM4SGG framework addresses these challenges by utilizing LLMs to improve both the extraction and alignment of triplets from captions. The approach is divided into two primary processes:

  1. Triplet Extraction (Chain-1): The LLM is used to extract triplets—structures consisting of a subject, predicate, and object—from both original and paraphrased image captions. By leveraging the LLM's language understanding capabilities, fine-grained predicates are better captured, mitigating the semantic over-simplification issue.
  2. Class Alignment (Chain-2): LLMs also assist in aligning the extracted triplet components with predefined entity and predicate classes in target datasets, minimizing the discard of useful triplets and thus reducing the low-density scene graph issue.

Importantly, to enhance the effectiveness of triplet extraction and alignment, LLM4SGG incorporates the Chain-of-Thought methodology, which allows step-by-step reasoning, and in-context few-shot learning to adapt LLM capabilities without requiring extensive model fine-tuning.

Experimental Analysis

The approach is validated through experiments on standard datasets like Visual Genome and GQA, showcasing significant improvements in mean Recall@K and Recall@K performance metrics—commonly used indices in scene graph tasks that measure the prediction accuracy of graph components. Compared with baseline methods, LLM4SGG exhibits a marked advantage in achieving data-efficient training, maintaining robustness even with relatively small training datasets.

Implications and Future Directions

LLM4SGG's success opens new avenues for enhancing WSSGG processes and potentially other computer vision tasks through LLMs. By improving triplet formation for scene graphs, the method holds promise for advancing systems that require detailed image understanding, such as autonomous vehicles, robotics, and complex visual question-answering systems.

Future research could explore the use of LLMs for grounding triplets directly in image regions, bypassing traditional object detectors, and possibly integrating vision-language representation models, which translate images into textual data conducive for LLM processing.

In conclusion, LLM4SGG demonstrates the potential of LLMs to address key limitations of weakly-supervised scene graph generation by intelligently parsing and aligning image-caption-based data, thereby facilitating more efficient and expansive scene graph generation frameworks.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com