600K-KS-OCR Dataset for Kashmiri OCR
- 600K-KS-OCR Dataset is a comprehensive synthetic corpus of about 602K word-level images designed to advance OCR research on the Kashmiri script.
- Data augmentation and diverse typefaces—including Naskh, Nastaleeq, and Nakash—simulate real-world document degradations for robust model training.
- Multi-format annotations compatible with CRNN and Transformer models enable seamless integration into various OCR training and evaluation pipelines.
The 600K-KS-OCR Dataset is a large-scale synthetic corpus of approximately 602,000 word-level, pre-segmented images designed for advancing optical character recognition (OCR) research on the Kashmiri script—a modified Perso-Arabic system for the endangered Dardic language spoken by an estimated seven million people. The dataset systematically addresses the acute resource gap for Kashmiri OCR, delivering ground-truth transcriptions in multiple formats compatible with both convolutional recurrent neural network (CRNN) and Transformer-based architectures such as TrOCR. Data augmentation strategies emulate real-world document degradations, while a diverse array of traditional typefaces and background textures provide robust visual variance. The entire corpus is released under a permissive Creative Commons Attribution 4.0 (CC-BY-4.0) license, facilitating open research and practical development for low-resource script digitization and preservation (Malik, 3 Jan 2026).
1. Corpus Structure and Distribution
The dataset comprises precisely ~602,000 PNG images, each representing a single Kashmiri word rendered at 256×64 pixel resolution in RGB color space and stored in lossless PNG format. For efficient distribution and usability, the data is partitioned into ten ZIP archives (P1 through P10), each under ~1.5 GB. Each archive contains:
- Directory
images/with word-level PNG files named sequentially (e.g., image_000001.png). data.csv: filename-text pairs (CSV format).data.jsonl: JSON Lines for TrOCR-style ingestion.labels.txt: CRNN (Connectionist Temporal Classification, CTC) format with tab-separated filename and label.metadata.json: archive-level metadata, including counts, font names, and augmentation proportions.
The sample distribution across partitions is as follows:
| Partition | # Samples | % of Total |
|---|---|---|
| P1_OCR_dataset | 50,000 | 8.3% |
| P2_OCR_dataset | 53,815 | 8.9% |
| P3_OCR_dataset | 68,741 | 11.4% |
| P4_OCR_dataset | 69,886 | 11.6% |
| P5_OCR_dataset | 69,637 | 11.6% |
| P6_OCR_dataset | 69,506 | 11.5% |
| P7_OCR_dataset | 58,228 | 9.7% |
| P8_OCR_dataset | 35,720 | 5.9% |
| P9_OCR_dataset | 86,635 | 14.4% |
| P10_OCR_dataset | 41,401 | 6.9% |
| Total | ≈602,000 | 100% |
2. Typeface and Script Design
The dataset covers three traditional Kashmiri typefaces, each comprising approximately one-third of the data and representing distinct textual domains:
- Afan Koshur Naksh (Naskh style): Emulates book-print letterforms with clear structure.
- Nastaleeq: Utilizes slanted, flowing lines and diagonal baseline, characteristic of calligraphic manuscripts.
- Nakash (Narqalam): Simulates natural handwriting with variable stroke width and slant.
Representative samples include:
- Afan_Koshur_Naksh: image_012345.png → "سٔلام", image_045678.png → "کتاب"
- Nastaleeq: image_102345.png → "Ù…ØØ¨Øª", image_156789.png → "زبان"
- Nakash: image_202345.png → "دوست", image_256789.png → "درس"
This font diversity is critical for robust OCR model generalization across print and handwritten-style inputs.
3. Image Rendering and Augmentation Pipeline
Synthetic sample creation follows a deterministic pipeline:
- Base layer: 256×64 px background, white or textured.
- Text overlay: Black (#000000), right-to-left Kashmiri string in one font.
- Composite: Stored as PNG.
Augmentation is systematically applied to 60% of samples; 40% remain clean. Transformations include:
- Geometric: Rotation (θ ∈ [–5°, +5°]), perspective warp (3×3 homography, ±0.02 corner noise), skew (s ∈ [–0.1, +0.1]).
- Blur: Gaussian blur (σ ∈ [0.5, 1.5]), motion blur (L ∈ [5, 15] px, angle φ ∈ [0°, 360°]).
- Noise: Additive Gaussian (μ=0, σ²=(5/255)²), salt-and-pepper (p ∈ [0.001, 0.005]).
- Photometric: Brightness (b ∈ [0.8, 1.2]), contrast (c ∈ [0.8, 1.2]), simulated JPEG artifacts (q ∈ [30, 80]), scaling (r ∈ [0.9, 1.1]).
- Document-specific: Paper textures (18 total), shadow/gradient (α∈[0.3,0.7]), ink bleed (morphological dilation 1–2 px + blur).
Background texture blending employs per-pixel mixing:
Texture categories comprise pure white, aged parchment, antique book paper, notebook/ledger, newspaper grain, distressed effects, and 14 additional custom backgrounds.
4. Ground-Truth Annotation Formats
Each archive supplies transcriptions in four redundant, model-agnostic formats:
- CRNN (labels.txt): image_name<TAB>Kashmiri_text (for CTC models).
- TrOCR (data.jsonl): JSON objects: {"file_name":..., "text":...}
- CSV (data.csv): filename,text (CSV-2).
- Metadata (metadata.json): Key-value pairs recording archive configuration (e.g., number of clean vs. augmented, fonts used).
These multi-format transcriptions ensure compatibility with a wide spectrum of OCR training pipelines and evaluation regimes.
5. Dataset Compatibility and Research Utility
600K-KS-OCR supports direct ingestion by standard OCR frameworks:
- CRNN-based OCR (CTC loss)
- Transformer encoder-decoder OCR (TrOCR and similar)
- Generic machine learning pipelines via CSV/JSONL
- Hugging Face Datasets:
load_dataset("Omarrran/600k_KS_OCR_Word_Segmented_Dataset")
Primary use cases include:
- End-to-end Kashmiri OCR training
- Benchmarking algorithms for low-resource scripts
- Transfer learning across Perso-Arabic script families
- Manuscript, newspaper, and archival document digitization
- Computational preservation and improved accessibility of endangered-language texts
6. Licensing, Access, and Statistical Summary
The dataset is published under a Creative Commons Attribution 4.0 International (CC-BY-4.0) license, permitting unrestricted use with required attribution and no commercial restrictions beyond citation. Distribution occurs via the Hugging Face Datasets Hub: https://huggingface.co/datasets/Omarrran/600k_KS_OCR_Word_Segmented_Dataset.
A tabular digest of key dataset properties is provided:
| Statistic | Value |
|---|---|
| Total images | ~602,000 |
| Image size | 256×64 px |
| Fonts (per font approx.) | ~200,000 |
| Clean vs. Augmented | 40% / 60% |
| Augmentation types | Geometric, blur, noise, photometric, document effects |
| Background variations | 18 textures |
7. Notes on Usage, Recommendations, and Evaluation
For optimal model convergence and robustness, initial training on the clean subset (40%) is recommended, followed by fine-tuning with augmented samples. Domain adaptation or enrichment with additional authentic samples is advised if downstream application data (e.g., handwriting-heavy documents) substantially diverge from synthetic data properties. No baseline OCR accuracy is provided; standard evaluation protocols recommend reporting Character Error Rate (CER) and Word Error Rate (WER) on held-out real-world Kashmiri test sets. The dataset is calibrated for immediate download, ingestion, and model training to accelerate OCR research in low-resource Kashmiri script (Malik, 3 Jan 2026).