Separate and Locate: Rethink the Text in Text-based Visual Question Answering (2308.16383v1)

Published 31 Aug 2023 in cs.CV and cs.MM

Abstract: Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at https://github.com/fangbufang/SaL.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Chengyang Fang (4 papers)
Jiangnan Li (30 papers)
Liang Li (297 papers)
Can Ma (21 papers)
Dayong Hu (4 papers)

Citations (11)

View on Semantic Scholar

GitHub

GitHub - fangbufang/SaL (7 stars)

Separate and Locate: Rethink the Text in Text-based Visual Question Answering (2308.16383v1)

Related Papers

GitHub