Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? (2311.17647v2)

Published 29 Nov 2023 in cs.CV, cs.AI, and cs.CL

Abstract: Recent multimodal LLMs (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight benchmarks, including OKVQA, MM-Vet, MathVista, MMMU, and probe diverse MLLMs in both the text-modality instruction (TEM) setting and VIM setting. Notably, we observe a significant performance disparity between the original TEM and VIM settings for open-source MLLMs, indicating that open-source MLLMs face greater challenges when text instruction is presented solely in image form. To address this issue, we train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yujie Lu (42 papers)
  2. Xiujun Li (37 papers)
  3. William Yang Wang (254 papers)
  4. Yejin Choi (287 papers)
  5. Zhe Gan (135 papers)
  6. Jianfeng Gao (344 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com