Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering (2010.12917v1)

Published 24 Oct 2020 in cs.CV and cs.AI

Abstract: Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zan-Xia Jin (1 paper)
  2. Heran Wu (1 paper)
  3. Chun Yang (45 papers)
  4. Fang Zhou (44 papers)
  5. Jingyan Qin (4 papers)
  6. Lei Xiao (68 papers)
  7. Xu-Cheng Yin (35 papers)
Citations (29)