Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction (2407.14133v2)

Published 19 Jul 2024 in cs.CL

Abstract: Visual LLMs (VLMs) are essential for various tasks, particularly visual reasoning tasks, due to their robust multi-modal information integration, visual reasoning capabilities, and contextual awareness. However, existing \VLMs{}' visual spatial reasoning capabilities are often inadequate, struggling even with basic tasks such as distinguishing left from right. To address this, we propose the \ours{} model, designed to enhance the visual spatial reasoning abilities of VLMS. ZeroVLM employs Zero-1-to-3, a 3D reconstruction model for obtaining different views of the input images and incorporates a prompting mechanism to further improve visual spatial reasoning. Experimental results on four visual spatial reasoning datasets show that our \ours{} achieves up to 19.48% accuracy improvement, which indicates the effectiveness of the 3D reconstruction and prompting mechanisms of our ZeroVLM.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zaiqiao Meng (42 papers)
  2. Hao Zhou (351 papers)
  3. Yifang Chen (31 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets