BanglaWriting: Bengali Handwriting Dataset
- BanglaWriting dataset is a comprehensive offline Bengali handwriting resource featuring high-quality page images and dense word-level annotations.
- It includes detailed demographic metadata and dual imaging sources, capturing real-world writing artifacts essential for robust OCR and biometric studies.
- The dataset’s structured annotations and explicit error marking enable reproducible benchmarking for transcription, generative modeling, and writer identification.
The BanglaWriting dataset is a multi-purpose offline Bengali handwriting resource designed to support advanced research in optical character recognition (OCR), handwriting generation, segmentation, and writer identification. It provides high-quality page-level images of handwritten Bengali text, dense word-level manual annotations, and comprehensive demographic metadata, facilitating tasks that require fine-grained linguistic and biometric analysis. The dataset’s detailed structure and robust annotation protocols make it a cornerstone for evaluating and developing state-of-the-art algorithms in the Bengali script domain (Mridha et al., 2020).
1. Dataset Composition and Statistical Properties
BanglaWriting comprises handwritten samples from 260 distinct individuals, each contributing a single A4 page of unconstrained Bengali text. The total content encompasses 21,234 word instances and 32,787 characters, with a vocabulary size of 5,470 unique Bengali words. The dataset integrates real-world writing phenomena, including 261 cases of partially overwritten but legible words and 450 occurrences characterized by full strikethroughs, mistakes, or random ink-marks.
Key dataset statistics:
- Average words per page:
- Average characters per word:
- Unique-word ratio:
The word-count per page follows a near-normal distribution centered around 80, with some pages exceeding 150 words. The prevalence of words containing two to four graphemes reflects the agglutinative character of Bengali orthography. This statistical profile demonstrates both the vocabulary diversity and the typical morphological structure of the language in handwritten form (Mridha et al., 2020).
2. Annotation Protocols and Data Organization
Each page is annotated at the word level, with precise bounding boxes and Unicode string labels. Annotations are manually generated and verified using the labelme tool. The dataset provides both the original (“raw”) digitized images and an enhanced (“converted”) variant processed for illumination and background consistency, produced using OpenCV and a supplemental Python script.
Each raw and processed image is named using the schema personIdentifier_age_gender.jpg, accompanied by a JSON annotation file with the same base name. The JSON “shapes” array records bounding boxes and labels as:
1 2 3 4 |
{
"label": "UnicodeString",
"points": [[xmin, ymin], [xmax, ymax]]
} |
- Exact UTF-8 transcription for normal words.
- When overwriting occurs but the word remains readable, the label omits struck-out graphemes and appends an asterisk “*”.
- For fully struck-through words or indiscriminate ink-marks, the label is “*” only.
All bounding boxes and transcriptions are manually reviewed, maximizing annotation fidelity (Mridha et al., 2020).
3. Demographic, Geographic, and Acquisition Details
The dataset encodes age (8 years to mature adult), binary gender (0 = female, 1 = male), and geographic origin directly in filenames, supporting handwriting variation studies along these demographic axes. Contributors are distributed across eight districts of Bangladesh: Dhaka, Gopalganj, Comilla, Gazipur, Tangail, Netrakona, Kishoreganj, and Mymensingh, with 14 to 48 pages sourced per district. The gender ratio, while not specified numerically, is represented in the published distribution plots as balanced across both age and gender.
Acquisition was performed with two device types to reflect practical conditions: 52 pages were scanned via flatbed, while 208 relied on smartphone camera photography. This dual-source protocol introduces lighting variations, shadows, and device-specific color profiles, further increasing ecological validity for downstream document analysis tasks (Mridha et al., 2020).
4. File Structure and Accessibility
The data repository is structured as follows:
1 2 3 4 5 6 7 |
banglawriting/
├─ raw/
│ ├─ personId_age_gender.jpg
│ ├─ personId_age_gender.json
├─ converted/
├─ personId_age_gender.jpg
├─ personId_age_gender.json |
- Both images and annotations are provided for each instance.
- Researchers can download the resource as two ZIP archives (“raw.zip” and “converted.zip”) from https://data.mendeley.com/datasets/r43wkvdk4w/1.
- Licensing blends open-access use for academic purposes, with citation required for derivative works (Mridha et al., 2020).
5. Applications and Research Significance
BanglaWriting is immediately deployable for:
- Optical Word Recognition (OWR): Real-world imaging artifacts enable evaluation of OWR systems’ robustness to noise, non-uniform backgrounds, and writer variability.
- Robust Transcription: Explicit annotation of mistakes and overwriting events supports research in error-aware modeling and correction systems.
- Writer Identification/Verification: Unique writer IDs, age, gender, and geographic metadata facilitate biometric handwriting research and demographic handwriting analysis.
- Generative Modeling: Clean, word-segmented data is suitable for training generative models to synthesize handwritten Bengali words, enabling text-to-handwriting conversion systems.
- Segmentation-Plus-Recognition: The presence of hand-drawn bounding boxes and word-level labels enables end-to-end training for architectures combining segmentation and transcription.
Because no single train/validation/test split is prescribed, the dataset admits flexible partitioning (e.g., 80/10/10 writer-wise allocation) for comparable and reproducible benchmarking across handwriting recognition and identification frameworks (Mridha et al., 2020).
6. Comparative Context and Distinction
Relative to other Bengali handwriting resources—such as BN-HTRd, which supplies multi-level annotation for large-scale page, line, and word recognition tasks (Rahman et al., 2022), or BanglaLekha-Isolated, which focuses on isolated character recognition (Biswas et al., 2017)—BanglaWriting is distinguished by its:
- Dense word-level manual annotation,
- Explicit representation of over-writing and error events,
- Balanced demographic and geographic sampling,
- Dual-channel imaging strategy with both scanner and smartphone acquisition.
This distinguishes it as a uniquely versatile and challenging resource for the offline recognition and generative modeling of Bengali handwriting, supporting both applied OCR research and foundational studies of handwritten script variation (Mridha et al., 2020).