Synthetic Japanese OCR Dataset
- Synthetic Japanese OCR Dataset is an artificially generated collection of Japanese text images paired with transcriptions that serves as a controlled benchmark for OCR and multimodal model evaluation.
- It leverages programmatic text generation and real-world PDF extraction to capture diverse layouts, vertical and horizontal writing, and a wide range of fonts for comprehensive model testing.
- The dataset enables detailed performance analysis with metrics like CER and BLEU-1, while highlighting transferability challenges between synthetic and real-world scanned documents.
A synthetic Japanese Optical Character Recognition (OCR) dataset is an artificially generated collection of Japanese text images and corresponding transcriptions, engineered to facilitate the development, fine-tuning, and systematic evaluation of OCR and multimodal models on Japanese language documents. Such datasets serve as critical benchmarks and training resources, particularly for testing model generalization to specific typographical conventions unique to Japanese, such as vertical writing, diverse scripts (kanji, hiragana, katakana), and a wide variety of document layouts. Recent advances in vision–language modeling and Japanese multimodal benchmarks have underscored the importance of high-fidelity synthetic OCR datasets for accelerating progress in Japanese document understanding (Baek et al., 20 Feb 2025, Sasagawa et al., 19 Nov 2025).
1. Rationale and Design Principles
The synthesis of Japanese OCR datasets addresses the limited availability of high-coverage, annotated real-world Japanese document corpora suitable for optical character recognition tasks. Traditional approaches relying on translated English datasets or manual annotation fail to capture the complexities of Japanese orthography, vertical writing, and font diversity. A synthetic pipeline offers precise control over layout, font distribution, and textual variability, enabling systematic analysis and model improvement. In the context of Large Multimodal Models (LMMs) and Multimodal LLMs (MLLMs), synthetic OCR benchmarks provide controlled testbeds for evaluating reading capabilities under both horizontal and vertical conventions and for diagnosing transferability to real-world scanned documents (Sasagawa et al., 19 Nov 2025).
2. Dataset Construction Pipelines
Two canonical methodologies define recent synthetic Japanese OCR dataset construction:
A. Synthetic Rendering from LLM-Generated Text
This approach, exemplified by the JSSODa dataset (Sasagawa et al., 19 Nov 2025), involves programmatic text generation and image synthesis:
- Text Generation: Paragraph-scale Japanese texts are synthesized by prompting an LLM (LLM-jp-3.1-instruct) with nouns drawn from a lexical resource (e.g., the JUMAN dictionary), filtering for outputs between 100 and 3,000 characters.
- Layout and Sampling: Generated texts are sorted by length and mapped to 1–4 column configurations. Each configuration is rendered in both horizontal and vertical writing, resulting in eight equally proportioned "layout types."
- Image Rendering: The Pillow library renders text on blank white backgrounds, drawing per character and randomly selecting from ~200 Japanese font files. No synthetic noise, distortion, or compression artifacts are applied—the dataset is intentionally "clean."
- Annotation and Structure: Each rendered image is paired with a plaintext transcription. No bounding box or character-level annotations are provided.
B. Automated Extraction from Real-World PDFs
Baek et al. (Baek et al., 20 Feb 2025) detail a semi-synthetic approach where image-text pairs are mined from a large corpus of Japanese PDF documents:
- PDF Layout Analysis: Source PDFs (≤5 pages) from the National Diet Library corpus are parsed using PyMuPDF, filtering for pages with embedded images.
- Region Proposal and OCR: High-resolution JPEGs of each page are analyzed by Surya, which proposes bounding boxes labeled "text" or "image." Cropped text regions are fed to Surya’s transformer–CNN OCR backbone trained on 90+ languages.
- Vision–Language Pairing: For each page, image features and corresponding OCR text features are embedded using Japanese-Cloob (a CLIP ViT-B/16 variant). Pairs are selected by maximum cosine similarity.
- Filtering and Enhancement: Pairs with insufficient image/text size or non-Japanese labeling are removed. GPT-4o-mini is used for NSFW/PII filtering and to optionally synthesize "PDF-style" rewrites for each text region and to produce multimodal instruction-tuning data.
3. Dataset Statistics and Analysis
JSSODa Synthetic Dataset
| Property | Value/Range | Notes |
|---|---|---|
| Total images | 22,493 | Across 8 layout types (horizontal/vertical × 1–4 columns) |
| Text length per image | 100–3,000 chars | Mean ~706 chars (train/test) |
| Fonts | ~200 unique | Sourced from Google Fonts and free-fonts.jp |
| Splits | 17,991/2,256/2,246 | Train/val/test |
| Annotations | Full transcript (UTF-8) | No bounding boxes |
PDF-derived OCR Dataset
| Property | Value/Range | Notes |
|---|---|---|
| PDFs processed | 200,000 first pages | From National Diet Library of Japan |
| Initial candidate pairs | ~400,000–600,000 | 2–3 per page (varies) |
| Filtered high-confidence pairs | ~300,000 | After size/language filtering |
| Instruction-tuning pairs | 362,000 | Q&A or instruction–response format |
| Text length/image resolution | ~5–200 chars, 100–800 px | Means/variances not published |
Both datasets provide comprehensive coverage of linguistic and layout diversity. JSSODa offers deterministic ground truth, while the PDF-derived dataset achieves broader authenticity at the cost of some residual OCR noise.
4. Evaluation Metrics and Benchmarks
Standard OCR evaluation metrics are employed for synthetic Japanese OCR datasets, particularly for vertical and horizontal reading:
- Character Error Rate (CER):
where is character substitutions, deletions, insertions, and the total reference characters.
- BLEU-1 Score (character-level): Calculated via SacreBLEU with NFKC normalization and whitespace handling.
JSSODa supports the benchmarking of model fine-tuning effects:
- On JSSODa test (vertical writing): Qwen2.5-VL-7B (raw) CER is 26.8%; after fine-tuning, CER declines dramatically to 0.104% and BLEU-1 rises to 99.8%.
- On real-world vertical (VJRODa): fine-tuning on clean synthetic data alone does not confer equivalent gains (CER rises to 65.1%), suggesting limits in transferability from synthetic to real OCR scenarios (Sasagawa et al., 19 Nov 2025).
Baek et al. (Baek et al., 20 Feb 2025) do not report CER or WER for their PDF-derived dataset, noting the absence of ground-truth transcriptions.
5. Applications
Synthetic Japanese OCR datasets have several critical applications:
- Model Pretraining and Fine-tuning: They enable domain-shift adaptation for OCR, vision–language pretraining, and for imparting vertical-writing reading capacity to LMMs and MLLMs.
- Benchmarking: Used as controlled testbeds to identify breakdowns in OCR pipeline performance, especially across different writing orientations and layout complexities.
- Instruction Tuning: PDF-derived datasets augmented via LLMs support multimodal instruction-following tasks, providing Q&A or contextualized response generation grounded in visual text regions.
- Error Analysis and Robustness Research: Facilitate diagnosis of model failures in recognizing Japanese scripts, parsing multicolumn layouts, or generalizing from synthetic to real document aesthetics.
6. Release, Distribution, and Limitations
- JSSODa and all related code are publicly available at https://github.com/LLM-jp/eval_vertical_ja. The dataset is organized by train/val/test splits with associated image and transcript files, and rendering scripts for synthetic extension. Metadata is provided in JSON with per-image attributes (layout, columns, font, char_count, filename, transcript).
- PDF-derived datasets are provided in a HuggingFace-compatible format after acceptance, featuring tuples containing image-crop, OCR text, optional synthetic prose, and instruction data (Baek et al., 20 Feb 2025). Licensing is permissive research use, final terms TBD at publication time.
Notably, both datasets forego classical image augmentations (blur, noise, distortion); JSSODa is “clean” by design to isolate orientation effects, while PDF-derived data may include only minimal authentic document degradation. This suggests that for improvements in real-world generalization, future work may need to explore synthetic noise and other document perturbations.
7. Current Limitations and Future Directions
The main limitations observed are:
- Gap in Transferability: Models fine-tuned on clean synthetic data (JSSODa) show improvement on synthetic test sets but can perform worse on real-world scanned documents, especially for vertically written text, highlighting a domain shift (Sasagawa et al., 19 Nov 2025).
- Annotation Scope: Absence of bounding box or character-level ground truth in JSSODa restricts analysis to whole-image transcription accuracy.
- Incomplete Evaluation Statistics: Some key statistics (means, variances for text/image size) and OCR system metrics (CER/WER for PDF-mined data) are not published (Baek et al., 20 Feb 2025).
Future developments may address hybrid synthetic/real pipelines, augmentation strategies, expansion to additional scripts and layouts, and refined annotation schemes to close the gap between synthetic and real-world Japanese OCR system performance.