An Empirical Study of Scaling Law for OCR (2401.00028v3)

Published 29 Dec 2023 in cs.CV

Abstract: The laws of model size, data volume, computation and model performance have been extensively studied in the field of NLP. However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%. The models and dataset are publicly available at https://github.com/large-ocr-model/large-ocr-model.github.io.

References (96)

Citations (5)

View on Semantic Scholar

Summary

The paper reveals that OCR performance follows a power law, with accuracy improving predictably as model size and data increase.
It introduces the REBU-Syn dataset and extends architectures like TrOCR and PARSeq to empirically test scaling in text recognition.
The study offers practical guidelines for optimizing OCR systems through a balanced mix of real and synthetic data and task-specific pretraining.

An Empirical Study of Scaling Law for OCR: A Comprehensive Analysis

This paper investigates the unexplored area of scaling laws within Optical Character Recognition (OCR), specifically focusing on text recognition tasks. Unlike its well-researched counterparts in the field of NLP, OCR scaling laws have remained largely theoretical, until now. The authors have undertaken extensive empirical research to provide concrete evidence and insights into how model performance correlates with model size, data volume, and computational investment in OCR systems, particularly for scene text recognition tasks.

Key to this paper is the construction of a novel, large-scale dataset named REBU-Syn, consisting of 6 million real samples paired with 18 million synthetic ones, collectively enabling a robust environment for experimentation. The synthesis of both real and synthetic data allows the model to benefit from the diverse and rich data landscape.

The experiments employ state-of-the-art architectures such as TrOCR and PARSeq, extending them to larger models with parameter sizes ranging from 22 million to 1 billion. The paper confirms that there exists a smooth power law in OCR: model performance follows predictable patterns of improvement with proportional increases in model size, data, and computation. Notably, the authors report achieving a top-1 average accuracy of 97.42% on standard benchmarks, setting a new standard in the field.

Several essential observations are made citing that large-scale models prove more sample-efficient compared to their smaller counterparts, thereby achieving fewer errors with a given dataset size. The paper also highlights the significance of data composition in training regimes: an optimal mix of real and synthetic data is crucial for improved performance. Furthermore, the alignment of OCR and pretraining tasks is underscored as task-specific pretraining enhances efficiency over traditional image-centric pretraining.

The implications of these findings are manifold. Practically, this paper suggests guidelines for constructing more effective OCR systems. It confirms that scaling laws could serve as a heuristic for optimizing resources — balancing data, computation, and model scaling — to achieve desired outcomes efficiently. Theoretically, this work solidifies the hypothesis of power laws in OCR, aligning it with established scaling laws in NLP and computer vision domains.

These contributions echo a broader narrative that understands scaling laws as fundamental indicators of transformative model and data interactions. With the substantial performance improvements demonstrated, future developments may include exploring similar scaling phenomena in more challenging OCR tasks, such as handwriting recognition or historical document transcription.

In conclusion, this paper sets a benchmark for exploring scaling laws in OCR, grounding its paper in empirical evidence that unfolds opportunities for further research in optimizing OCR technology. Researchers and developers stand to gain a deeper understanding of dynamics that govern OCR performance, aiding in the creation of increasingly proficient recognition systems. The systematic and thorough approach taken by the authors makes a significant contribution to the academic and practical landscapes of OCR research.

PDF Markdown

GitHub

GitHub - large-ocr-model/large-ocr-model.github.io (148 stars)

An Empirical Study of Scaling Law for OCR (2401.00028v3)

Summary

An Empirical Study of Scaling Law for OCR: A Comprehensive Analysis

Related Papers

GitHub