Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering (2405.11985v3)

Published 20 May 2024 in cs.CV

Abstract: Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial "visual-textual misalignment" problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal LLMs~(MLLMs), including Qwen2-VL, GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA benchmark, it is evident that there is still a large room for performance improvement (Qwen2-VL scoring 30.9 versus 79.7 for human performance), underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension. The project homepage is available at https://bytedance.github.io/MTVQA/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, Dec 2023.
  2. The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv:2309.17421, Sep 2023.
  3. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  4. AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
  5. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  7. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
  8. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  9. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arXiv preprint arXiv:2311.11810, 2023a.
  10. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. arXiv preprint arXiv:2308.11592, 2023b.
  11. mPLUG-DocOwl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024.
  12. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024a.
  13. Textsquare: Scaling up text-centric visual instruction tuning. arXiv preprint arXiv:2404.12803, 2024.
  14. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024.
  15. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
  16. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
  17. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024b. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  18. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
  19. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  20. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  21. Viocrvqa: Novel benchmark dataset and vision reader for visual question answering by understanding vietnamese text in images. arXiv preprint arXiv:2404.18397, 2024.
  22. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
  23. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  24. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. URL https://aclanthology.org/2022.findings-acl.177.
  25. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016.
  26. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  27. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  28. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  29. Human-adversarial visual question answering. Advances in Neural Information Processing Systems, 34:20346–20359, 2021.
  30. Visual instruction tuning. Advances in neural information processing systems, 36, 2024c.
  31. Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems, 28, 2015.
  32. Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020.
  33. Visually grounded reasoning across languages and cultures. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.818. URL https://aclanthology.org/2021.emnlp-main.818.
  34. DuReadervissubscriptDuReadervis\textrm{DuReader}_{\textrm{vis}}DuReader start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT: A Chinese dataset for open-domain document visual question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 1338–1351, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.105. URL https://aclanthology.org/2022.findings-acl.105.
  35. Visual question answering dataset for bilingual image understanding: A study of cross-lingual transfer using attention maps. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1918–1928, 2018.
  36. Vlsp2022-evjvqa challenge: Multilingual visual question answering. arXiv preprint arXiv:2302.11752, 2023.
  37. Towards developing a multilingual and code-mixed visual question answering system by knowledge distillation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1753–1767, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.151. URL https://aclanthology.org/2021.findings-emnlp.151.
  38. xGQA: Cross-lingual visual question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2497–2511, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.196. URL https://aclanthology.org/2022.findings-acl.196.
  39. MaXM: Towards multilingual visual question answering. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2667–2682, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.176. URL https://aclanthology.org/2023.findings-emnlp.176.
  40. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.
  41. mPLUG-DocOwl: Modularized multimodal large language model for document understanding. arXiv:2307.02499, 2023.
  42. A unified framework for multilingual and code-mixed visual question answering. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 900–913, 2020.
  43. Must-vqa: multilingual scene-text vqa. In European Conference on Computer Vision, pages 345–358. Springer, 2022.
  44. An empirical study of multilingual scene-text visual question answering. In Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos, pages 3–8, 2023.
  45. Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019. In 2019 International conference on document analysis and recognition (ICDAR), pages 1582–1587. IEEE, 2019.
  46. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Jingqun Tang (22 papers)
  2. Qi Liu (485 papers)
  3. Yongjie Ye (8 papers)
  4. Jinghui Lu (28 papers)
  5. Shu Wei (17 papers)
  6. Chunhui Lin (9 papers)
  7. Wanqing Li (53 papers)
  8. Mohamad Fitri Faiz Bin Mahmood (1 paper)
  9. Hao Feng (83 papers)
  10. Zhen Zhao (85 papers)
  11. Yanjie Wang (18 papers)
  12. Yuliang Liu (82 papers)
  13. Hao Liu (497 papers)
  14. Xiang Bai (221 papers)
  15. Can Huang (43 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com