An Empirical Study of Scaling Law for OCR (2401.00028v3)
Abstract: The laws of model size, data volume, computation and model performance have been extensively studied in the field of NLP. However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%. The models and dataset are publicly available at https://github.com/large-ocr-model/large-ocr-model.github.io.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Rowel Atienza. Vision transformer for fast and efficient scene text recognition. 2021.
- What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4715–4723, 2019.
- What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. 2021.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
- Scene text recognition with permuted autoregressive sequence models. 2022.
- Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
- Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 71–79, 2018.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE international conference on computer vision, pages 5076–5084, 2017.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), pages 935–942. IEEE, 2017.
- Icdar2019 robust reading challenge on arbitrary-shaped text (rrc-art). 2019.
- Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
- Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Svtr: Scene text recognition with a single visual model. arXiv preprint arXiv:2205.00159, 2022.
- Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. 2021.
- Scaling laws for multilingual neural machine translation. 2023.
- Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740, 2021.
- Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
- Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
- Vizwiz grand challenge: Answering visual questions from blind people, 2018.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4593–4603, 2022.
- Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.
- Revisiting scene text recognition: A data perspective. 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Icdar 2013 robust reading competition. pages 1484–1493, 2013.
- Icdar 2015 competition on robust reading. pages 1156–1160, 2015.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
- Open images v5 text annotation and yet another mask text spotter. In Asian Conference on Machine Learning, pages 379–389. PMLR, 2021.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. 2018.
- Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2231–2239, 2016.
- On recognizing texts of arbitrary shapes with 2d self-attention. 2019.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
- Gshard: Scaling giant models with conditional computation and automatic sharding. 2020.
- Show, attend and read: A simple and strong baseline for irregular text recognition. 2019.
- Trocr: Transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282, 2021.
- An inverse scaling law for clip training. arXiv preprint arXiv:2305.07017, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018a.
- Char-net: A character-aware neural network for distorted scene text recognition. In Proceedings of the AAAI conference on artificial intelligence, 2018b.
- Icdar 2019 robust reading challenge on reading chinese text on signboard. 2019a.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019b.
- A multi-object rectified attention network for scene text recognition. 2019.
- Maskocr: text recognition with masked encoder-decoder pretraining. arXiv preprint arXiv:2206.00311, 2022.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
- Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
- Scene text recognition using higher order language priors. BMVC-British machine vision conference, 2012.
- Multi-modal text recognition networks: Interactive enhancements between visual and semantic features. 2022.
- Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition – rrc-mlt-2019. 2019.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023a.
- OpenAI. Gpt-4 technical report. 2023b.
- Recognizing text with perspective distortion in natural scenes. pages 569–576, 2013.
- Seed: Semantics enhanced encoder-decoder framework for scene text recognition. 2020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Transformer-based text detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3162–3171, 2021.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
- A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18):8027–8048, 2014.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Neural machine translation of rare words with subword units. 2016.
- Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International conference on document analysis and recognition (ICDAR), pages 781–786. IEEE, 2019.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. 2015.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2016.
- Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 41(9):2035–2048, 2018a.
- Icdar2017 competition on reading chinese text in the wild (rctw-17). 2018b.
- Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation. arXiv preprint arXiv:2310.16809, 2023.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
- Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. 2021.
- Chinese street view text: Large-scale chinese text reading with partially supervised learning. 2020.
- Pure transformer with integrated experts for scene text recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 481–497. Springer, 2022.
- Training data-efficient image transformers & distillation through attention. pages 10347–10357, 2021.
- Coco-text: Dataset and benchmark for text detection and recognition in natural images. 2016.
- Textscanner: Reading characters in order for robust scene text recognition. CoRR, abs/1912.12422, 2019.
- End-to-end scene text recognition. pages 1457–1464, 2011.
- Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019a.
- Decoupled attention network for text recognition. 2019b.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
- From two to one: A new scene text recognizer with visual language modeling network. 2021.
- Petr: Rethinking the capability of transformer-based language model in scene text recognition. IEEE Transactions on Image Processing, 31:5585–5598, 2022.
- Reading and writing: Discriminative and generative modeling for self-supervised text recognition. 2023.
- mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023.
- Towards accurate scene text recognition with semantic reasoning networks. CoRR, abs/2003.12294, 2020a.
- Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12113–12122, 2020b.
- Robustscanner: Dynamically enhancing positional clues for robust text recognition. 2020.
- Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170, 2017.
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
- Uber-text: A large-scale dataset for optical character recognition fromstreet-level imagery. In Proc. Scene Understand. Workshop, pages 1–2, 2017.
- Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model. 2023.