Papers
Topics
Authors
Recent
2000 character limit reached

600K-KS-OCR Dataset for Kashmiri OCR

Updated 10 January 2026
  • 600K-KS-OCR Dataset is a comprehensive synthetic corpus of about 602K word-level images designed to advance OCR research on the Kashmiri script.
  • Data augmentation and diverse typefaces—including Naskh, Nastaleeq, and Nakash—simulate real-world document degradations for robust model training.
  • Multi-format annotations compatible with CRNN and Transformer models enable seamless integration into various OCR training and evaluation pipelines.

The 600K-KS-OCR Dataset is a large-scale synthetic corpus of approximately 602,000 word-level, pre-segmented images designed for advancing optical character recognition (OCR) research on the Kashmiri script—a modified Perso-Arabic system for the endangered Dardic language spoken by an estimated seven million people. The dataset systematically addresses the acute resource gap for Kashmiri OCR, delivering ground-truth transcriptions in multiple formats compatible with both convolutional recurrent neural network (CRNN) and Transformer-based architectures such as TrOCR. Data augmentation strategies emulate real-world document degradations, while a diverse array of traditional typefaces and background textures provide robust visual variance. The entire corpus is released under a permissive Creative Commons Attribution 4.0 (CC-BY-4.0) license, facilitating open research and practical development for low-resource script digitization and preservation (Malik, 3 Jan 2026).

1. Corpus Structure and Distribution

The dataset comprises precisely ~602,000 PNG images, each representing a single Kashmiri word rendered at 256×64 pixel resolution in RGB color space and stored in lossless PNG format. For efficient distribution and usability, the data is partitioned into ten ZIP archives (P1 through P10), each under ~1.5 GB. Each archive contains:

  • Directory images/ with word-level PNG files named sequentially (e.g., image_000001.png).
  • data.csv: filename-text pairs (CSV format).
  • data.jsonl: JSON Lines for TrOCR-style ingestion.
  • labels.txt: CRNN (Connectionist Temporal Classification, CTC) format with tab-separated filename and label.
  • metadata.json: archive-level metadata, including counts, font names, and augmentation proportions.

The sample distribution across partitions is as follows:

Partition # Samples % of Total
P1_OCR_dataset 50,000 8.3%
P2_OCR_dataset 53,815 8.9%
P3_OCR_dataset 68,741 11.4%
P4_OCR_dataset 69,886 11.6%
P5_OCR_dataset 69,637 11.6%
P6_OCR_dataset 69,506 11.5%
P7_OCR_dataset 58,228 9.7%
P8_OCR_dataset 35,720 5.9%
P9_OCR_dataset 86,635 14.4%
P10_OCR_dataset 41,401 6.9%
Total ≈602,000 100%

2. Typeface and Script Design

The dataset covers three traditional Kashmiri typefaces, each comprising approximately one-third of the data and representing distinct textual domains:

  • Afan Koshur Naksh (Naskh style): Emulates book-print letterforms with clear structure.
  • Nastaleeq: Utilizes slanted, flowing lines and diagonal baseline, characteristic of calligraphic manuscripts.
  • Nakash (Narqalam): Simulates natural handwriting with variable stroke width and slant.

Representative samples include:

  • Afan_Koshur_Naksh: image_012345.png → "سٔلام", image_045678.png → "کتاب"
  • Nastaleeq: image_102345.png → "محبت", image_156789.png → "زبان"
  • Nakash: image_202345.png → "دوست", image_256789.png → "درس"

This font diversity is critical for robust OCR model generalization across print and handwritten-style inputs.

3. Image Rendering and Augmentation Pipeline

Synthetic sample creation follows a deterministic pipeline:

  • Base layer: 256×64 px background, white or textured.
  • Text overlay: Black (#000000), right-to-left Kashmiri string in one font.
  • Composite: Stored as PNG.

Augmentation is systematically applied to 60% of samples; 40% remain clean. Transformations include:

  • Geometric: Rotation (θ ∈ [–5°, +5°]), perspective warp (3×3 homography, ±0.02 corner noise), skew (s ∈ [–0.1, +0.1]).
  • Blur: Gaussian blur (σ ∈ [0.5, 1.5]), motion blur (L ∈ [5, 15] px, angle φ ∈ [0°, 360°]).
  • Noise: Additive Gaussian (μ=0, σ²=(5/255)²), salt-and-pepper (p ∈ [0.001, 0.005]).
  • Photometric: Brightness (b ∈ [0.8, 1.2]), contrast (c ∈ [0.8, 1.2]), simulated JPEG artifacts (q ∈ [30, 80]), scaling (r ∈ [0.9, 1.1]).
  • Document-specific: Paper textures (18 total), shadow/gradient (α∈[0.3,0.7]), ink bleed (morphological dilation 1–2 px + blur).

Background texture blending employs per-pixel mixing:

I(x,y)=α⋅Ttext(x,y)+(1−α)⋅Btexture(x,y),α∈[0.9,1.0]I(x,y) = \alpha \cdot T_{text}(x,y) + (1-\alpha) \cdot B_{texture}(x,y), \quad \alpha\in[0.9,1.0]

Texture categories comprise pure white, aged parchment, antique book paper, notebook/ledger, newspaper grain, distressed effects, and 14 additional custom backgrounds.

4. Ground-Truth Annotation Formats

Each archive supplies transcriptions in four redundant, model-agnostic formats:

  • CRNN (labels.txt): image_name<TAB>Kashmiri_text (for CTC models).
  • TrOCR (data.jsonl): JSON objects: {"file_name":..., "text":...}
  • CSV (data.csv): filename,text (CSV-2).
  • Metadata (metadata.json): Key-value pairs recording archive configuration (e.g., number of clean vs. augmented, fonts used).

These multi-format transcriptions ensure compatibility with a wide spectrum of OCR training pipelines and evaluation regimes.

5. Dataset Compatibility and Research Utility

600K-KS-OCR supports direct ingestion by standard OCR frameworks:

  • CRNN-based OCR (CTC loss)
  • Transformer encoder-decoder OCR (TrOCR and similar)
  • Generic machine learning pipelines via CSV/JSONL
  • Hugging Face Datasets: load_dataset("Omarrran/600k_KS_OCR_Word_Segmented_Dataset")

Primary use cases include:

  • End-to-end Kashmiri OCR training
  • Benchmarking algorithms for low-resource scripts
  • Transfer learning across Perso-Arabic script families
  • Manuscript, newspaper, and archival document digitization
  • Computational preservation and improved accessibility of endangered-language texts

6. Licensing, Access, and Statistical Summary

The dataset is published under a Creative Commons Attribution 4.0 International (CC-BY-4.0) license, permitting unrestricted use with required attribution and no commercial restrictions beyond citation. Distribution occurs via the Hugging Face Datasets Hub: https://huggingface.co/datasets/Omarrran/600k_KS_OCR_Word_Segmented_Dataset.

A tabular digest of key dataset properties is provided:

Statistic Value
Total images ~602,000
Image size 256×64 px
Fonts (per font approx.) ~200,000
Clean vs. Augmented 40% / 60%
Augmentation types Geometric, blur, noise, photometric, document effects
Background variations 18 textures

7. Notes on Usage, Recommendations, and Evaluation

For optimal model convergence and robustness, initial training on the clean subset (40%) is recommended, followed by fine-tuning with augmented samples. Domain adaptation or enrichment with additional authentic samples is advised if downstream application data (e.g., handwriting-heavy documents) substantially diverge from synthetic data properties. No baseline OCR accuracy is provided; standard evaluation protocols recommend reporting Character Error Rate (CER) and Word Error Rate (WER) on held-out real-world Kashmiri test sets. The dataset is calibrated for immediate download, ingestion, and model training to accelerate OCR research in low-resource Kashmiri script (Malik, 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to 600K-KS-OCR Dataset.