Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks (2305.13782v1)

Published 23 May 2023 in cs.CL

Abstract: LLMs have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input -- but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the LLM using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access LLMs against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that LLMs are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model's output by providing a means of tracing the output back through the verbalised image content.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Sherzod Hakimov (37 papers)
  2. David Schlangen (51 papers)
Citations (4)