Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Advanced Visual Reasoning Ability of Large Language Models (2409.13980v1)

Published 21 Sep 2024 in cs.CV and cs.AI

Abstract: Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models' advanced reasoning ability. Traditional Vision-LLMs (VLMs) perform well in visual perception tasks while struggling with complex reasoning scenarios. Conversely, LLMs demonstrate robust text reasoning capabilities; however, they lack visual acuity. To bridge this gap, we propose Complex Visual Reasoning LLMs (CVR-LLM), capitalizing on VLMs' visual perception proficiency and LLMs' extensive reasoning capability. Unlike recent multimodal LLMs (MLLMs) that require a projection layer, our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop and leverages LLMs' text knowledge for accurate predictions without extra training. We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs' contextual understanding and reasoning. Additionally, we introduce Chain-of-Comparison (CoC), a step-by-step comparison technique enabling contrasting various aspects of predictions. Our CVR-LLM presents the first comprehensive study across a wide array of complex visual reasoning tasks and achieves SOTA performance among all.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhiyuan Li (304 papers)
  2. Dongnan Liu (47 papers)
  3. Chaoyi Zhang (51 papers)
  4. Heng Wang (136 papers)
  5. Tengfei Xue (23 papers)
  6. Weidong Cai (118 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com