Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RelationVLM: Making Large Vision-Language Models Understand Visual Relations (2403.12801v1)

Published 19 Mar 2024 in cs.CV

Abstract: The development of Large Vision-LLMs (LVLMs) is striving to catch up with the success of LLMs, yet it faces more challenges to be resolved. Very recent works enable LVLMs to localize object-level visual contents and ground text to them. Nonetheless, current LVLMs still struggle to precisely understand visual relations due to the lack of relevant data. In this work, we present RelationVLM, a large vision-LLM capable of comprehending various levels and types of relations whether across multiple images or within a video. Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations, temporal associations and geometric transforms. Extensive case studies and quantitative evaluations show RelationVLM has strong capability in understanding such relations and emerges impressive in-context capability of reasoning from few-shot examples by comparison. This work fosters the advancements of LVLMs by enabling them to support a wider range of downstream applications toward artificial general intelligence.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhipeng Huang (34 papers)
  2. Zhizheng Zhang (60 papers)
  3. Zheng-Jun Zha (143 papers)
  4. Yan Lu (179 papers)
  5. Baining Guo (53 papers)
Citations (3)