Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study of Scaling Law for OCR (2401.00028v3)

Published 29 Dec 2023 in cs.CV

Abstract: The laws of model size, data volume, computation and model performance have been extensively studied in the field of NLP. However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%. The models and dataset are publicly available at https://github.com/large-ocr-model/large-ocr-model.github.io.

An Empirical Study of Scaling Law for OCR: A Comprehensive Analysis

This paper investigates the unexplored area of scaling laws within Optical Character Recognition (OCR), specifically focusing on text recognition tasks. Unlike its well-researched counterparts in the field of NLP, OCR scaling laws have remained largely theoretical, until now. The authors have undertaken extensive empirical research to provide concrete evidence and insights into how model performance correlates with model size, data volume, and computational investment in OCR systems, particularly for scene text recognition tasks.

Key to this paper is the construction of a novel, large-scale dataset named REBU-Syn, consisting of 6 million real samples paired with 18 million synthetic ones, collectively enabling a robust environment for experimentation. The synthesis of both real and synthetic data allows the model to benefit from the diverse and rich data landscape.

The experiments employ state-of-the-art architectures such as TrOCR and PARSeq, extending them to larger models with parameter sizes ranging from 22 million to 1 billion. The paper confirms that there exists a smooth power law in OCR: model performance follows predictable patterns of improvement with proportional increases in model size, data, and computation. Notably, the authors report achieving a top-1 average accuracy of 97.42% on standard benchmarks, setting a new standard in the field.

Several essential observations are made citing that large-scale models prove more sample-efficient compared to their smaller counterparts, thereby achieving fewer errors with a given dataset size. The paper also highlights the significance of data composition in training regimes: an optimal mix of real and synthetic data is crucial for improved performance. Furthermore, the alignment of OCR and pretraining tasks is underscored as task-specific pretraining enhances efficiency over traditional image-centric pretraining.

The implications of these findings are manifold. Practically, this paper suggests guidelines for constructing more effective OCR systems. It confirms that scaling laws could serve as a heuristic for optimizing resources — balancing data, computation, and model scaling — to achieve desired outcomes efficiently. Theoretically, this work solidifies the hypothesis of power laws in OCR, aligning it with established scaling laws in NLP and computer vision domains.

These contributions echo a broader narrative that understands scaling laws as fundamental indicators of transformative model and data interactions. With the substantial performance improvements demonstrated, future developments may include exploring similar scaling phenomena in more challenging OCR tasks, such as handwriting recognition or historical document transcription.

In conclusion, this paper sets a benchmark for exploring scaling laws in OCR, grounding its paper in empirical evidence that unfolds opportunities for further research in optimizing OCR technology. Researchers and developers stand to gain a deeper understanding of dynamics that govern OCR performance, aiding in the creation of increasingly proficient recognition systems. The systematic and thorough approach taken by the authors makes a significant contribution to the academic and practical landscapes of OCR research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Rowel Atienza. Vision transformer for fast and efficient scene text recognition. 2021.
  3. What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4715–4723, 2019.
  4. What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. 2021.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  6. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  7. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
  8. Scene text recognition with permuted autoregressive sequence models. 2022.
  9. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
  10. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 71–79, 2018.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  12. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE international conference on computer vision, pages 5076–5084, 2017.
  13. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  14. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), pages 935–942. IEEE, 2017.
  15. Icdar2019 robust reading challenge on arbitrary-shaped text (rrc-art). 2019.
  16. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  17. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
  18. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  19. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  20. Svtr: Scene text recognition with a single visual model. arXiv preprint arXiv:2205.00159, 2022.
  21. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. 2021.
  22. Scaling laws for multilingual neural machine translation. 2023.
  23. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740, 2021.
  24. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
  25. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
  26. Vizwiz grand challenge: Answering visual questions from blind people, 2018.
  27. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  28. Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4593–4603, 2022.
  29. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.
  30. Revisiting scene text recognition: A data perspective. 2023.
  31. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  32. Icdar 2013 robust reading competition. pages 1484–1493, 2013.
  33. Icdar 2015 competition on robust reading. pages 1156–1160, 2015.
  34. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  35. Open images v5 text annotation and yet another mask text spotter. In Asian Conference on Machine Learning, pages 379–389. PMLR, 2021.
  36. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. 2018.
  37. Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2231–2239, 2016.
  38. On recognizing texts of arbitrary shapes with 2d self-attention. 2019.
  39. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  40. Gshard: Scaling giant models with conditional computation and automatic sharding. 2020.
  41. Show, attend and read: A simple and strong baseline for irregular text recognition. 2019.
  42. Trocr: Transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282, 2021.
  43. An inverse scaling law for clip training. arXiv preprint arXiv:2305.07017, 2023.
  44. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  45. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  46. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018a.
  47. Char-net: A character-aware neural network for distorted scene text recognition. In Proceedings of the AAAI conference on artificial intelligence, 2018b.
  48. Icdar 2019 robust reading challenge on reading chinese text on signboard. 2019a.
  49. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019b.
  50. A multi-object rectified attention network for scene text recognition. 2019.
  51. Maskocr: text recognition with masked encoder-decoder pretraining. arXiv preprint arXiv:2206.00311, 2022.
  52. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  53. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  54. Scene text recognition using higher order language priors. BMVC-British machine vision conference, 2012.
  55. Multi-modal text recognition networks: Interactive enhancements between visual and semantic features. 2022.
  56. Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition – rrc-mlt-2019. 2019.
  57. OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023a.
  58. OpenAI. Gpt-4 technical report. 2023b.
  59. Recognizing text with perspective distortion in natural scenes. pages 569–576, 2013.
  60. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. 2020.
  61. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  62. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  63. Transformer-based text detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3162–3171, 2021.
  64. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
  65. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18):8027–8048, 2014.
  66. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  67. Neural machine translation of rare words with subword units. 2016.
  68. Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International conference on document analysis and recognition (ICDAR), pages 781–786. IEEE, 2019.
  69. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. 2015.
  70. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2016.
  71. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 41(9):2035–2048, 2018a.
  72. Icdar2017 competition on reading chinese text in the wild (rctw-17). 2018b.
  73. Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation. arXiv preprint arXiv:2310.16809, 2023.
  74. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  75. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  76. Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. 2021.
  77. Chinese street view text: Large-scale chinese text reading with partially supervised learning. 2020.
  78. Pure transformer with integrated experts for scene text recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 481–497. Springer, 2022.
  79. Training data-efficient image transformers & distillation through attention. pages 10347–10357, 2021.
  80. Coco-text: Dataset and benchmark for text detection and recognition in natural images. 2016.
  81. Textscanner: Reading characters in order for robust scene text recognition. CoRR, abs/1912.12422, 2019.
  82. End-to-end scene text recognition. pages 1457–1464, 2011.
  83. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019a.
  84. Decoupled attention network for text recognition. 2019b.
  85. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  86. From two to one: A new scene text recognizer with visual language modeling network. 2021.
  87. Petr: Rethinking the capability of transformer-based language model in scene text recognition. IEEE Transactions on Image Processing, 31:5585–5598, 2022.
  88. Reading and writing: Discriminative and generative modeling for self-supervised text recognition. 2023.
  89. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023.
  90. Towards accurate scene text recognition with semantic reasoning networks. CoRR, abs/2003.12294, 2020a.
  91. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12113–12122, 2020b.
  92. Robustscanner: Dynamically enhancing positional clues for robust text recognition. 2020.
  93. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170, 2017.
  94. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  95. Uber-text: A large-scale dataset for optical character recognition fromstreet-level imagery. In Proc. Scene Understand. Workshop, pages 1–2, 2017.
  96. Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model. 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Miao Rang (3 papers)
  2. Zhenni Bi (4 papers)
  3. Chuanjian Liu (15 papers)
  4. Yunhe Wang (145 papers)
  5. Kai Han (184 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com