Bharat Scene Text Dataset (BSTD)
- Bharat Scene Text Dataset (BSTD) is a comprehensive, multilingual benchmark featuring over 100,000 word instances manually annotated across 12 scripts for Indian languages and English.
- It supports four core tasks—detection, script identification, cropped word recognition, and end-to-end recognition—with robust evaluation protocols and detailed baseline metrics.
- The dataset addresses challenges of script diversity and font variations in real-world scenes, providing a crucial resource for advancing research in Indian scene text understanding.
The Bharat Scene Text Dataset (BSTD) is a comprehensive, large-scale benchmark specifically designed to advance scene text understanding for Indian languages and English. Addressing a longstanding deficit in high-quality, multilingual datasets, BSTD comprises over 100,000 words extracted from 6,582 scene images sourced from diverse public spaces across India. The dataset offers meticulous manual annotation and robust evaluation protocols, supporting four principal scene text tasks: detection, script identification, cropped word recognition, and end-to-end recognition. BSTD has enabled systematic benchmarking of adapted and fine-tuned state-of-the-art models, providing crucial insights into the challenges of Indian language scene text recognition (De et al., 28 Nov 2025).
1. Motivation and Dataset Scope
Scene text recognition for English has achieved significant maturity, yet Indian languages—used by approximately 1.4 billion people—remain underexplored, primarily due to extreme script diversity, non-standardized fonts, varying writing styles, and the historic lack of high-quality annotated corpora. BSTD addresses these deficiencies by presenting a multi-lingual, multi-script, and publicly accessible benchmark, with coverage across Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, and English. Images in BSTD are sampled from Wikimedia Commons, depicting authentic signage and texts in urban Indian environments—bus stops, railway stations, ATMs, and public billboards—capturing the complexity and natural variation inherent to real-world textual content.
2. Dataset Composition and Statistics
BSTD consists of 6,582 images containing 106,478 manually boxed word instances, alongside 19,814 “Others” predominantly representing less-common scripts such as Meitei and Urdu. Annotation granularity extends to 126,292 word-level polygons, supporting precise spatial localization. The dataset is split image-wise into 5,263 training and 1,319 test images (80/20 ratio), corresponding to 94,128 and 32,164 word boxes, respectively. Word instance counts for the primary languages in train/test partitions are detailed below.
| Language | Train Words | Test Words |
|---|---|---|
| Assamese | 2,627 | 1,505 |
| Bengali | 4,936 | 1,368 |
| English | 29,123 | 12,573 |
| Gujarati | 1,884 | 1,015 |
| Hindi | 14,927 | 4,846 |
| Kannada | 2,208 | 720 |
| Malayalam | 2,393 | 547 |
| Marathi | 3,917 | 1,196 |
| Odia | 3,148 | 1,044 |
| Punjabi | 8,319 | 2,880 |
| Tamil | 2,029 | 513 |
| Telugu | 2,215 | 545 |
Word length and bounding box size distributions reveal a spectrum from isolated short texts to long text regions, with most text occupying small spatial extents (59.4% of instances range from $1,001$ to $10,000$ pixels squared), indicative of incidental and focused signage.
3. Annotation Methodology and Quality Control
Annotation is performed using word-level tight polygons, with coordinates serialized in per-image JSON files. Each word box is assigned a script label (one of the 12 supported scripts), provided by annotators familiar with regional script conventions and image context. Transcriptions are initialized using pseudo-labels from PARSeq models fine-tuned on synthetic Indian text, and subsequently refined by native speaker experts. All visible, legible words are transcribed; unrecognizable instances are marked “###” or omitted. Annotation quality control is two-tiered, imposing stringent consistency and error-correction procedures.
4. Supported Tasks and Evaluation Criteria
BSTD supports four core tasks:
- Scene Text Detection: Input is a raw scene image; output is a set of word-level polygons. Evaluated via precision (P), recall (R), and metrics as computed by TedEval with an IoU threshold .
- Script Identification: Input is a cropped word image; task is to classify into one of 12 scripts. Evaluated by overall accuracy and via confusion matrices.
- Cropped Word Recognition: Given a cropped word image and known script, predict the character sequence. Metric is Word Recognition Rate (WRR).
- End-to-End Scene Text Recognition: Pipeline involves detection script identification recognition on raw images. Metrics include WRR and Character Recognition Rate (CRR), calculated as
where , , denote substitutions, deletions, insertions at word () or character () granularity, and indicates the number of ground truth items; negative values are clipped to zero. Precision/Recall/ across recognized and ground-truth word matches are also computed, unconstrained by reading order.
5. Baseline Models and Benchmark Results
BSTD provides rigorous baselines by adapting and fine-tuning leading scene-text models, alongside domain-specific quantitative analyses.
Detection:
TextBPN++ fine-tuned on synthetic Indic text achieved best performance (P=0.75, R=0.78, =0.77). In comparison, other detectors evidenced lower (EAST, 0.17; CRAFT, 0.19; DBNet, 0.59; Hi-SAM, 0.46).
Script Identification:
The backbone for this task is ViT-Base-Patch16-224 (ImageNet-21k pre-trained). For 3-way classification (regional vs. Hindi vs. English), per-language accuracy ranged from 90%–95% (e.g., Telugu 95.0%, Kannada 94.5%, Bengali 93.3%). For the 12-way task, overall accuracy was 80.5%, with confusions primarily between Assamese/Bengali and Hindi/Marathi owing to script similarities. CLIP baseline accuracy was 67.7%.
Cropped Word Recognition:
The PARSeq model, trained in two stages (synthetic pre-training on SynthText + AI4Bharat data followed by fine-tuning on BSTD), yielded significant performance increases. On BSTD-Test:
- PARSeq (synthetic only): average WRR ≈ 47% (range 32–92%)
- PARSeq + BSTD fine-tune: average WRR ≈ 73% (56–92%)
- Example fine-tuned WRR: English 92%, Marathi 86%, Bengali 82%, Tamil 80%, Hindi 71%, Telugu 56%, Malayalam 58% Off-the-shelf OCR baselines (Tesseract: ~15%, PaddleOCR: ~29%, EasyOCR: ~18%) performed substantially worse.
End-to-End Recognition:
IndicPhotoOCR (an open-source pipeline: TextBPN++ + ViT script ID + PARSeq recognizer) achieved average WRR = 36% and CRR = 54%. Commercial benchmarks: Google OCR (WRR ≈ 41%, CRR ≈ 55%), GPT-4 Vision (WRR ≈ 13%, CRR ≈ 21%). Oracle variants using ground-truth detection and/or script identification achieved WRR up to 71% and CRR up to 88%, revealing the impact of error propagation, particularly in detection and script classification. English WRR in end-to-end setting degraded from 92% (cropped) to ~30% due to these sources of error propagation.
6. Data Availability and Toolkit Integration
BSTD and its documentation are publicly available:
- Project page: https://vl2g.github.io/projects/IndicPhotoOCR/
- Dataset repository: https://github.com/Bhashini-IITJ/BharatSceneTextDataset
An open-source toolkit, “IndicPhotoOCR” (MIT-style license), integrates detection, script identification, and recognition models:
- Toolkit: https://github.com/Bhashini-IITJ/IndicPhotoOCR
- Installation:
1 2
git clone https://github.com/Bhashini-IITJ/IndicPhotoOCR.git cd IndicPhotoOCR && ./setup.sh
- Python API provides
detect(),identify(),recognise(), andocr()methods, with usage examples in the provided README.
7. Significance and Research Outlook
BSTD represents the first comprehensive, publicly available dataset to support Indian language scene text tasks across detection, script identification, cropped word recognition, and end-to-end evaluation. Benchmarks demonstrate that, unlike the near-solved status of English scene text recognition, Indian language tasks exhibit multiple persisting challenges, including script ambiguities and robust font diversity. The dataset's open availability, detailed annotation, and strong baseline implementations provide a foundation for further research in multilingual scene text understanding, fostering progress in both model development and application domains where Indian scripts are underrepresented (De et al., 28 Nov 2025).