An Empirical Study of Scaling Law for OCR: A Comprehensive Analysis
This paper investigates the unexplored area of scaling laws within Optical Character Recognition (OCR), specifically focusing on text recognition tasks. Unlike its well-researched counterparts in the field of NLP, OCR scaling laws have remained largely theoretical, until now. The authors have undertaken extensive empirical research to provide concrete evidence and insights into how model performance correlates with model size, data volume, and computational investment in OCR systems, particularly for scene text recognition tasks.
Key to this paper is the construction of a novel, large-scale dataset named REBU-Syn, consisting of 6 million real samples paired with 18 million synthetic ones, collectively enabling a robust environment for experimentation. The synthesis of both real and synthetic data allows the model to benefit from the diverse and rich data landscape.
The experiments employ state-of-the-art architectures such as TrOCR and PARSeq, extending them to larger models with parameter sizes ranging from 22 million to 1 billion. The paper confirms that there exists a smooth power law in OCR: model performance follows predictable patterns of improvement with proportional increases in model size, data, and computation. Notably, the authors report achieving a top-1 average accuracy of 97.42% on standard benchmarks, setting a new standard in the field.
Several essential observations are made citing that large-scale models prove more sample-efficient compared to their smaller counterparts, thereby achieving fewer errors with a given dataset size. The paper also highlights the significance of data composition in training regimes: an optimal mix of real and synthetic data is crucial for improved performance. Furthermore, the alignment of OCR and pretraining tasks is underscored as task-specific pretraining enhances efficiency over traditional image-centric pretraining.
The implications of these findings are manifold. Practically, this paper suggests guidelines for constructing more effective OCR systems. It confirms that scaling laws could serve as a heuristic for optimizing resources — balancing data, computation, and model scaling — to achieve desired outcomes efficiently. Theoretically, this work solidifies the hypothesis of power laws in OCR, aligning it with established scaling laws in NLP and computer vision domains.
These contributions echo a broader narrative that understands scaling laws as fundamental indicators of transformative model and data interactions. With the substantial performance improvements demonstrated, future developments may include exploring similar scaling phenomena in more challenging OCR tasks, such as handwriting recognition or historical document transcription.
In conclusion, this paper sets a benchmark for exploring scaling laws in OCR, grounding its paper in empirical evidence that unfolds opportunities for further research in optimizing OCR technology. Researchers and developers stand to gain a deeper understanding of dynamics that govern OCR performance, aiding in the creation of increasingly proficient recognition systems. The systematic and thorough approach taken by the authors makes a significant contribution to the academic and practical landscapes of OCR research.