Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoLLaVO: Crayon Large Language and Vision mOdel (2402.11248v4)

Published 17 Feb 2024 in cs.CV

Abstract: The remarkable success of LLMs and instruction tuning drives the evolution of Vision LLMs (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Byung-Kwan Lee (14 papers)
  2. Beomchan Park (6 papers)
  3. Chae Won Kim (10 papers)
  4. Yong Man Ro (90 papers)
Citations (14)
Youtube Logo Streamline Icon: https://streamlinehq.com