Emergent Mind

On the Hidden Mystery of OCR in Large Multimodal Models

(2305.07895)
Published May 13, 2023 in cs.CV and cs.CL

Abstract

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark.Our study encompasses 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. OpenAI. ChatGPT. https://openai.com/blog/chatgpt/

  2. OpenAI. Gpt-4 technical report
  3. LLaMA: Open and Efficient Foundation Language Models
  4. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  5. Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/

  6. Instruction Tuning with GPT-4
  7. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision
  8. Learning Transferable Visual Models From Natural Language Supervision
  9. Florence: A New Foundation Model for Computer Vision
  10. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
  11. ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks
  12. PaLM-E: An Embodied Multimodal Language Model
  13. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736
  14. GIT: A Generative Image-to-text Transformer for Vision and Language
  15. Visual Instruction Tuning
  16. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
  17. GPT-4 Technical Report
  18. Openflamingo, March 2023
  19. MiniGPT-4: Enhancing vision-language understanding with advanced large language models
  20. mPLUG-Owl: Modularization empowers large language models with multimodality
  21. Microsoft COCO: Common objects in context. In ECCV
  22. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL
  23. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR
  24. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24
  25. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
  26. Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text
  27. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  28. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data
  29. Top-down and bottom-up cues for scene text recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2687–2694. IEEE
  30. End-to-end scene text recognition using tree-structured models. Pattern Recognition, 47:2853–2866
  31. ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493
  32. ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 1156–1160
  33. Recognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576
  34. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41:8027–8048
  35. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images
  36. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit., 90:337–345
  37. Total-Text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 935–942
  38. From two to one: A new scene text recognizer with visual language modeling network. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14174–14183
  39. Toward understanding WordArt: Corner-guided transformer for scene text recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 303–321. Springer
  40. The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5:39–46
  41. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE
  42. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition, pages 779–784
  43. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1563–1570
  44. OCR-VQA: visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 947–952
  45. Towards VQA models that can read
  46. DocVQA: A dataset for vqa on document images
  47. On the general value of evidence, and bilingual scene-text visual question answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10123–10132
  48. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706
  49. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
  50. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334. IEEE
  51. ICDAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, sep 2019.
  52. FUNSD: A dataset for form understanding in noisy scanned documents
  53. Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution
  54. Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4553–4562
  55. Scene text recognition with permuted autoregressive sequence models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 178–196. Springer
  56. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324
  57. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition
  58. Attentionhtr: Handwritten text recognition based on attention encoder-decoder networks
  59. An efficient prototype-based model for handwritten text recognition with multi-loss fusion. In Frontiers in Handwriting Recognition: 18th International Conference, ICFHR 2022, Hyderabad, India, December 4–7, 2022, Proceedings, pages 404–418. Springer
  60. Winner team mia at textvqa challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model
  61. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research
  62. Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. In Conference on Empirical Methods in Natural Language Processing
  63. Cross-modal attention networks with modality disentanglement for scene-text vqa. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE
  64. DUBLIN -- Document Understanding By Language-Image Network
  65. DePlot: One-shot visual language reasoning by plot-to-table translation
  66. Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia
  67. When counting meets hmer: Counting-aware network for handwritten mathematical expression recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 197–214. Springer
  68. ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
  69. Cord: A consolidated receipt dataset for post-ocr parsing. In Document Intelligence Workshop at Neural Information Processing Systems
  70. Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness
  71. Towards expert-level medical question answering with large language models
  72. Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Show All 72

Test Your Knowledge

You answered out of questions correctly.

Well done!