MMR: Evaluating Reading Ability of Large Multimodal Models (2408.14594v1)

Published 26 Aug 2024 in cs.CV

Abstract: Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of LLMs. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

Authors (6)

Jian Chen (257 papers)
Ruiyi Zhang (98 papers)
Yufan Zhou (36 papers)
Ryan Rossi (67 papers)
Jiuxiang Gu (73 papers)
Changyou Chen (108 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

HackerNews

MMR: Evaluating Reading Ability of Large Multimodal Models (3 points, 0 comments)

MMR: Evaluating Reading Ability of Large Multimodal Models (2408.14594v1)

Summary

Related Papers

HackerNews