Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMR: Evaluating Reading Ability of Large Multimodal Models (2408.14594v1)

Published 26 Aug 2024 in cs.CV

Abstract: Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of LLMs. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jian Chen (257 papers)
  2. Ruiyi Zhang (98 papers)
  3. Yufan Zhou (36 papers)
  4. Ryan Rossi (67 papers)
  5. Jiuxiang Gu (73 papers)
  6. Changyou Chen (108 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.