OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning (2501.00321v1)

Published 31 Dec 2024 in cs.CV and cs.AI

Abstract: Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.

PDF Abstract

An Evaluation of OCR Capabilities in Large Multimodal Models: Introduction of OCRBench v2

The paper "OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning" provides a substantial contribution to the evaluation of Large Multimodal Models (LMMs) in Optical Character Recognition (OCR) tasks. Previous benchmarks have acknowledged the prowess of LMMs in text recognition but have not adequately explored their capabilities in more complex tasks such as text localization, handwritten content extraction, and logical reasoning. Thus, the authors present OCRBench v2, a comprehensive bilingual text-centric benchmark aiming to bridge the gaps identified in existing evaluations.

OCRBench v2 is distinguished by its expansive coverage, featuring four times the number of tasks present in prior benchmarks and spanning 31 scenarios—from street scenes to scientific diagrams. It includes a variety of text-centric tasks, bolstered by 10,000 human-verified question-answering pairs and sophisticated evaluation metrics tailored to specific tasks.

Upon evaluating 38 state-of-the-art LMMs, the authors reveal that 36 models score below 50 out of 100, uncovering five key areas of limitations: less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. These empirical findings underscore the fact that, despite advancements, LMMs are not yet fully capable of overcoming the myriad challenges present in diverse text-rich environments.

Key Contributions and Methodology

OCRBench v2 offers a rigorous framework that breaks down OCR capabilities into eight core areas: text recognition, text referring, text spotting, relation extraction, element parsing, mathematical calculation, visual text understanding, and knowledge reasoning. This categorization is insightful for dissecting the strengths and challenges of current LMMs. The benchmark's methodological breadth ensures that various aspects of visual text processing are comprehensively evaluated, pushing beyond merely recognizing text to understanding its context and details within broader scenarios.

The benchmark utilizes a range of metrics to evaluate performance across tasks, including TEDS for parsing tasks and IoU scores for text localization, reflecting its intent to provide precise assessment tools relevant to the task's nature. For tasks involving logical reasoning and comprehension, metrics like BLEU, METEOR, and ANLS are employed.

Implications and Future Directions

The authors effectively demonstrate that LMMs, despite their zero-shot capabilities, still face difficulties in tasks that demand higher-order text understanding and reasoning, often required in real-world applications. The insights from OCRBench v2 imply that further enhancement is needed in developing LMMs that can execute fine-grained visual-textual analysis, perceive complex spatial relationships, and engage in logical reasoning with textual content.

Practically, this research guides future developments in optimizing LMM architectures to tackle high-res inputs, enhance token efficiency, and improve task-specific pretraining datasets. Theoretically, it emphasizes the importance of continued exploration into models that unify visual and textual processing more effectively, perhaps by incorporating more sophisticated contextual understanding mechanisms or hybrid approaches that combine traditional OCR techniques with LMMs.

In conclusion, OCRBench v2 represents a critical resource for advancing the field of multimodal AI and ensuring more nuanced and demanding OCR tasks are within the capability of future LMMs. This work establishes benchmarks that compel the research community to acknowledge and address the nuanced complexities present in visual text environments, paving the way for more robust and intelligent multimodal systems.

PDF Markdown Bookmark Chat (Pro)

Authors (24)

Ling Fu (13 papers)
Biao Yang (48 papers)
Zhebin Kuang (3 papers)
Jiajun Song (14 papers)
Yuzhe Li (16 papers)
Linghao Zhu (1 paper)
Qidi Luo (2 papers)
Xinyu Wang (186 papers)
Hao Lu (99 papers)
Mingxin Huang (14 papers)
Zhang Li (26 papers)
Guozhi Tang (8 papers)
Bin Shan (11 papers)
Chunhui Lin (9 papers)
Qi Liu (485 papers)
Binghong Wu (12 papers)
Hao Feng (83 papers)
Hao Liu (497 papers)
Can Huang (43 papers)
Jingqun Tang (22 papers)

Related Papers

Find Related Papers

GitHub

GitHub - Yuliang-Liu/MultimodalOCR: On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench) (492 stars)

Tweets

https://twitter.com/CSVisionPapers/status/1875191877885907359