Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal grid features and cell pointers for Scene Text Visual Question Answering (2006.00923v2)

Published 1 Jun 2020 in cs.CV

Abstract: This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it. The proposed model is based on an attention mechanism that attends to multi-modal features conditioned to the question, allowing it to reason jointly about the textual and visual modalities in the scene. The output weights of this attention module over the grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text the to the given question. Our experiments demonstrate competitive performance in two standard datasets. Furthermore, this paper provides a novel analysis of the ST-VQA dataset based on a human performance study.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ali Furkan Biten (17 papers)
  2. Rubèn Tito (12 papers)
  3. Marçal Rusiñol (20 papers)
  4. Ernest Valveny (28 papers)
  5. Dimosthenis Karatzas (80 papers)
  6. Lluís Gómez (3 papers)
  7. Andrés Mafla (4 papers)
Citations (20)