Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Traditional Chinese Synthetic Datasets Verified with Labeled Data for Scene Text Recognition (2111.13327v2)

Published 26 Nov 2021 in cs.CV

Abstract: Scene text recognition (STR) has been widely studied in academia and industry. Training a text recognition model often requires a large amount of labeled data, but data labeling can be difficult, expensive, or time-consuming, especially for Traditional Chinese text recognition. To the best of our knowledge, public datasets for Traditional Chinese text recognition are lacking. This paper presents a framework for a Traditional Chinese synthetic data engine which aims to improve text recognition model performance. We generated over 20 million synthetic data and collected over 7,000 manually labeled data TC-STR 7k-word as the benchmark. Experimental results show that a text recognition model can achieve much better accuracy either by training from scratch with our generated synthetic data or by further fine-tuning with TC-STR 7k-word.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yi-Chang Chen (14 papers)
  2. Yu-Chuan Chang (3 papers)
  3. Yen-Cheng Chang (4 papers)
  4. Yi-Ren Yeh (6 papers)
Citations (4)