Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering (2109.08029v3)

Published 15 Sep 2021 in cs.CV and cs.AI

Abstract: Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained LLMs have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained LLMs. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters. In contrast, our model is less effective in a standard VQA task (VQA 2.0) confirming that our text-only method is specially effective for tasks requiring external knowledge. In addition, we show that increasing the LLM's size improves notably its performance, yielding results comparable to the state-of-the-art with our largest model, significantly outperforming current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only LLMs. Our work opens up possibilities to further improve inference in visio-linguistic tasks

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ander Salaberria (8 papers)
  2. Gorka Azkune (14 papers)
  3. Oier Lopez de Lacalle (19 papers)
  4. Aitor Soroa (29 papers)
  5. Eneko Agirre (53 papers)
Citations (46)