On the Hidden Mystery of OCR in Large Multimodal Models (2305.07895v5)

Published 13 May 2023 in cs.CV and cs.CL

Abstract: Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark.Our study encompasses 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

PDF HTML Abstract

Evaluation of OCR Abilities in Large Multimodal Models

The paper "On the Hidden Mystery of OCR in Large Multimodal Models" provides an in-depth analysis of Optical Character Recognition (OCR) capabilities within Large Multimodal Models (LMMs) such as GPT4V and Gemini. It introduces OCRBench, an extensive evaluation benchmark that assesses these models across a range of text-related visual tasks. This research is notable for its comprehensive coverage, utilizing 29 datasets, thus offering a formidable foundation for understanding the strengths and limitations of LMMs in text-centric environments.

Methodology and Key Findings

The research evaluates OCR capabilities over five distinct tasks:

Text Recognition
Scene Text-Centric Visual Question Answering (VQA)
Document-Oriented VQA
Key Information Extraction (KIE)
Handwritten Mathematical Expression Recognition (HMER)

LMMs demonstrated competitive performance in text recognition for regular and semantic text, showing prowess in tasks typically dominated by domain-specific methods. However, the paper highlights that these models struggle substantially with handwritten text, multilingual text like Chinese, and non-semantic text, indicating an over-reliance on semantic cues for text recognition.

Strong Numerical Results

Despite promising results in certain areas, the efficacy of LMMs trails behind supervised state-of-the-art techniques, particularly in tasks involving handwritten and complex textual data. For example, the accuracy on handwritten data was significantly lower, emphasizing the gap between LMM capabilities and domain-specific solutions. Furthermore, LMMs were notably challenged by handwritten mathematical expression recognition, showcasing almost negligible competence in this domain.

Implications and Future Directions

The paper posits that while LMMs harbor considerable potential, their current limitations signify the need for task-specific enhancements, especially in processing fine-grained visual details and character-level recognition. This acknowledgment opens pathways for the refinement of multimodal approaches and encourages research targeting the integration of more sophisticated OCR instruction tuning.

Future investigations should focus on augmenting the training data of LMMs with text-centric datasets to potentially overcome the highlighted shortcomings. Exploring the scalability of LMM architectures to support higher-resolution inputs could enhance their utility in document-oriented and KIE tasks.

Additionally, the research prompts intriguing inquiries into the balance of multimodal training data and its impact on OCR proficiency. The insights garnered from these multimodal models, such as those presented here, could very well inform the next wave of advancements in OCR technologies.

In essence, while LMMs like GPT4V and Gemini showcase an ability to generalize across multifaceted text recognition tasks, this paper underscores the necessity for specialized and enhanced training methodologies to bridge the existing performance gaps with domain-specific models. The implications for both theoretical exploration and practical application in AI-driven OCR are substantial, setting the stage for future advancements in the field.