MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering (2405.11985v3)

Published 20 May 2024 in cs.CV

Abstract: Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial "visual-textual misalignment" problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal LLMs~(MLLMs), including Qwen2-VL, GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA benchmark, it is evident that there is still a large room for performance improvement (Qwen2-VL scoring 30.9 versus 79.7 for human performance), underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension. The project homepage is available at https://bytedance.github.io/MTVQA/.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (46)

Authors (15)

Jingqun Tang (22 papers)
Qi Liu (485 papers)
Yongjie Ye (8 papers)
Jinghui Lu (28 papers)
Shu Wei (17 papers)
Chunhui Lin (9 papers)
Wanqing Li (53 papers)
Mohamad Fitri Faiz Bin Mahmood (1 paper)
Hao Feng (83 papers)
Zhen Zhao (85 papers)
Yanjie Wang (18 papers)
Yuliang Liu (82 papers)
Hao Liu (497 papers)
Xiang Bai (221 papers)
Can Huang (43 papers)

Citations (10)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1859712513320870269

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering (2405.11985v3)

Related Papers

Tweets