Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Hidden Mystery of OCR in Large Multimodal Models (2305.07895v5)

Published 13 May 2023 in cs.CV and cs.CL

Abstract: Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark.Our study encompasses 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

Evaluation of OCR Abilities in Large Multimodal Models

The paper "On the Hidden Mystery of OCR in Large Multimodal Models" provides an in-depth analysis of Optical Character Recognition (OCR) capabilities within Large Multimodal Models (LMMs) such as GPT4V and Gemini. It introduces OCRBench, an extensive evaluation benchmark that assesses these models across a range of text-related visual tasks. This research is notable for its comprehensive coverage, utilizing 29 datasets, thus offering a formidable foundation for understanding the strengths and limitations of LMMs in text-centric environments.

Methodology and Key Findings

The research evaluates OCR capabilities over five distinct tasks:

  1. Text Recognition
  2. Scene Text-Centric Visual Question Answering (VQA)
  3. Document-Oriented VQA
  4. Key Information Extraction (KIE)
  5. Handwritten Mathematical Expression Recognition (HMER)

LMMs demonstrated competitive performance in text recognition for regular and semantic text, showing prowess in tasks typically dominated by domain-specific methods. However, the paper highlights that these models struggle substantially with handwritten text, multilingual text like Chinese, and non-semantic text, indicating an over-reliance on semantic cues for text recognition.

Strong Numerical Results

Despite promising results in certain areas, the efficacy of LMMs trails behind supervised state-of-the-art techniques, particularly in tasks involving handwritten and complex textual data. For example, the accuracy on handwritten data was significantly lower, emphasizing the gap between LMM capabilities and domain-specific solutions. Furthermore, LMMs were notably challenged by handwritten mathematical expression recognition, showcasing almost negligible competence in this domain.

Implications and Future Directions

The paper posits that while LMMs harbor considerable potential, their current limitations signify the need for task-specific enhancements, especially in processing fine-grained visual details and character-level recognition. This acknowledgment opens pathways for the refinement of multimodal approaches and encourages research targeting the integration of more sophisticated OCR instruction tuning.

Future investigations should focus on augmenting the training data of LMMs with text-centric datasets to potentially overcome the highlighted shortcomings. Exploring the scalability of LMM architectures to support higher-resolution inputs could enhance their utility in document-oriented and KIE tasks.

Additionally, the research prompts intriguing inquiries into the balance of multimodal training data and its impact on OCR proficiency. The insights garnered from these multimodal models, such as those presented here, could very well inform the next wave of advancements in OCR technologies.

In essence, while LMMs like GPT4V and Gemini showcase an ability to generalize across multifaceted text recognition tasks, this paper underscores the necessity for specialized and enhanced training methodologies to bridge the existing performance gaps with domain-specific models. The implications for both theoretical exploration and practical application in AI-driven OCR are substantial, setting the stage for future advancements in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023.
  2. OpenAI. Gpt-4 technical report, 2023.
  3. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  4. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  5. Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.
  6. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023.
  7. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 2022.
  8. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  9. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  10. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.
  11. ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks, 2022.
  12. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  13. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022.
  14. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  15. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  16. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  17. Openai. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  18. Openflamingo, March 2023.
  19. MiniGPT-4: Enhancing vision-language understanding with advanced large language models, 2023.
  20. mPLUG-Owl: Modularization empowers large language models with multimodality, 2023.
  21. Microsoft COCO: Common objects in context. In ECCV, 2014.
  22. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  23. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  24. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  25. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  26. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
  27. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  28. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023.
  29. Top-down and bottom-up cues for scene text recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2687–2694. IEEE, 2012.
  30. End-to-end scene text recognition using tree-structured models. Pattern Recognition, 47:2853–2866, 2014.
  31. ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, 2013.
  32. ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 1156–1160, 2015.
  33. Recognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576, 2013.
  34. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41:8027–8048, 2014.
  35. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. ArXiv, abs/1601.07140, 2016.
  36. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit., 90:337–345, 2019.
  37. Total-Text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 935–942, 2017.
  38. From two to one: A new scene text recognizer with visual language modeling network. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14174–14183, 2021.
  39. Toward understanding WordArt: Corner-guided transformer for scene text recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 303–321. Springer, 2022.
  40. The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5:39–46, 2002.
  41. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE, 2019.
  42. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition, pages 779–784, 2014.
  43. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1563–1570, 2019.
  44. OCR-VQA: visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 947–952, 2019.
  45. Towards VQA models that can read, 2019.
  46. DocVQA: A dataset for vqa on document images, 2021.
  47. On the general value of evidence, and bilingual scene-text visual question answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10123–10132, 2020.
  48. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  49. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  50. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334. IEEE, 2017.
  51. ICDAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, sep 2019.
  52. FUNSD: A dataset for form understanding in noisy scanned documents, 2019.
  53. Visual information extraction in the wild: Practical dataset and end-to-end solution. arXiv preprint arXiv:2305.07498, 2023.
  54. Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4553–4562, 2022.
  55. Scene text recognition with permuted autoregressive sequence models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 178–196. Springer, 2022.
  56. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
  57. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.
  58. Attentionhtr: Handwritten text recognition based on attention encoder-decoder networks, 2022.
  59. An efficient prototype-based model for handwritten text recognition with multi-loss fusion. In Frontiers in Handwriting Recognition: 18th International Conference, ICFHR 2022, Hyderabad, India, December 4–7, 2022, Proceedings, pages 404–418. Springer, 2022.
  60. Winner team mia at textvqa challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model, 2021.
  61. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  62. Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. In Conference on Empirical Methods in Natural Language Processing, 2022.
  63. Cross-modal attention networks with modality disentanglement for scene-text vqa. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2022.
  64. Dublin–document understanding by language-image network. arXiv preprint arXiv:2305.14218, 2023.
  65. Deplot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505, 2022.
  66. Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, 2021.
  67. When counting meets hmer: Counting-aware network for handwritten mathematical expression recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 197–214. Springer, 2022.
  68. ICL-D3IE: In-context learning with diverse demonstrations updating for document information extraction. ArXiv, abs/2303.05063, 2023.
  69. Cord: A consolidated receipt dataset for post-ocr parsing. In Document Intelligence Workshop at Neural Information Processing Systems, 2019.
  70. Evaluating ChatGPT’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. ArXiv, abs/2304.11633, 2023.
  71. Towards expert-level medical question answering with large language models, 2023.
  72. Llava-med: Training a large language-and-vision assistant for biomedicine in one day, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuliang Liu (82 papers)
  2. Zhang Li (26 papers)
  3. Biao Yang (48 papers)
  4. Chunyuan Li (122 papers)
  5. Xucheng Yin (4 papers)
  6. Lianwen Jin (116 papers)
  7. Xiang Bai (221 papers)
  8. Cheng-Lin Liu (71 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com