Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want (2403.20271v2)

Published 29 Mar 2024 in cs.CV

Abstract: The interaction between humans and AI is a crucial factor that reflects the effectiveness of multimodal LLMs (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal LLM (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Weifeng Lin (15 papers)
  2. Xinyu Wei (15 papers)
  3. Ruichuan An (14 papers)
  4. Peng Gao (401 papers)
  5. Bocheng Zou (6 papers)
  6. Yulin Luo (13 papers)
  7. Siyuan Huang (123 papers)
  8. Shanghang Zhang (172 papers)
  9. Hongsheng Li (340 papers)
Citations (20)