Indian Language Scene Text Recognition

Updated 5 December 2025

Indian Language Scene Text Recognition is the automated process of detecting, transcribing, and analyzing text in natural images featuring diverse Indic scripts.
Modern methodologies utilize deep learning pipelines combining text detection, script identification, and word recognition, evaluated by metrics like WRR, CER, and F1 scores.
Key challenges include complex orthography, visual noise, and data scarcity, driving innovations in synthetic pre-training, transfer learning, and script-aware architecture design.

Indian language scene text recognition encompasses the automated detection, transcription, and interpretation of text appearing in natural scene images across India’s diverse scripts. The field addresses significant challenges such as script diversity, complex orthography, idiosyncratic type styles, substantial visual noise, and a pronounced scarcity of high-fidelity annotated datasets. In recent years, the discipline has progressed from ad hoc detection-recognition pipelines to multimodal, large-scale deep learning systems, underpinned by both synthetic and real data, and evaluated through robust, fine-grained benchmarks targeting multiple Indic scripts (De et al., 28 Nov 2025, Lunia et al., 12 Mar 2024, Mathew et al., 2021, Gunna et al., 2022, Nag et al., 2018).

1. Datasets and Benchmarking Corpora

The availability of annotated corpora that reflect the linguistic and visual heterogeneity of the Indian context is a central enabler for model development and evaluation. Table 1 summarizes representative datasets:

Dataset	Languages	Real Word Count	Scene Images
BSTD (De et al., 28 Nov 2025)	12 incl. English	126,292	6,582
IndicSTR12 (Lunia et al., 12 Mar 2024)	12 (excl. English)	27,000+	—
IIIT-ILST (Mathew et al., 2021)	3	≈3,100	—
MLT-17/19	Hindi, Bangla, others	4,000+	—

BSTD provides the most comprehensive scene-anchored resource, spanning Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, and English, and supports multi-task annotation: polygonal detection, cropped word recognition, script ID, and end-to-end evaluation. IndicSTR12 matches large Latin benchmarks in both scale and complexity for 12 Indic scripts, with expert transcriptions and deliberate tagging of orientation, occlusion, and resolution (Lunia et al., 12 Mar 2024). The IIIT-ILST corpus, though smaller, remains a staple for benchmarking Devanagari, Telugu, and Malayalam (Mathew et al., 2021).

Synthetic datasets, typically ranging from two to over ten million word-instances per language, are now standard in pre-training, generated via font, background, and geometric augmentation pipelines adapted to the Unicode properties and orthography of Indian scripts (Lunia et al., 12 Mar 2024, Mathew et al., 2021, Gunna et al., 2022).

2. Model Architectures and Recognition Pipelines

Modern pipelines for Indian language scene text recognition are predominantly deep learning-based, integrating three canonical components: text detection, script identification, and word recognition.

Detection

EAST variants: Used as a base in earlier works, the Efficient and Accurate Scene Text detector employs a fully convolutional PVAnet backbone, followed by a geometry head that predicts rotated rectangles or quadrilaterals. Non-maximum suppression is calibrated to Indian scene specifics, e.g., τ_NMS = 0.2 for close wordlines (Nag et al., 2018).
TextBPN++: A transformer-enhanced boundary proposal network, surpassing prior methods on BSTD with F1 ≈ 0.77, due to high recall of small and incidental word instances (De et al., 28 Nov 2025).

Script Identification

ViT-based classifiers: Vision Transformer backbones achieve 80.5% (12-way) to 95% accuracy (3-way, e.g., Telugu) following fine-tuning on BSTD, though confusion persists between glyph-similar pairs such as Assamese/Bengali and Hindi/Marathi (De et al., 28 Nov 2025).

Word Recognition

CRNN/Hybrid CNN-RNN: Seven-layer VGG-style CNN encoders feed two- or three-layer BiLSTMs, with CTC loss enforcing sequence alignment. Pooling strategy is adjusted (2×1 spatial window) to preserve higher temporal resolution (Mathew et al., 2021, Gunna et al., 2022).
PARSeq: A transformer-based autoregressive recognizer, pre-trained on synthetic data, then fine-tuned on real cropped word instances; demonstrates leading WRR in multiscript evaluations (Lunia et al., 12 Mar 2024, De et al., 28 Nov 2025).
STARNet: Incorporates a spatial transformer for geometric rectification, Inception-ResNet feature extraction, BiLSTM sequence modeling, and CTC transcription. Fine-tuned STARNet architectures consistently outperform CRNNs on real-word benchmarks (Gunna et al., 2022).

End-to-end pipelines, such as IndicPhotoOCR, integrate all components and set the current open-source state-of-the-art for both detection and recognition in Indian scenes (De et al., 28 Nov 2025).

3. Evaluation Protocols and Metrics

Benchmarking in Indian scene text recognition employs standard and rigorous metrics at all stages:

Detection: Precision (P), recall (R), and F1-score under TedEval or similar protocols, using polygonal overlap matching (De et al., 28 Nov 2025, Nag et al., 2018).
Script Identification: Classification accuracy on cropped word images, with balanced per-language sampling (De et al., 28 Nov 2025).
Recognition: Word Recognition Rate (WRR), Character Recognition Rate (CRR), Word Error Rate (WER), and Character Error Rate (CER):
- $\mathrm{WRR} = \left(1 - \frac{S_w + D_w + I_w}{N_w^\mathit{total}}\right) \times 100\%$
- $\mathrm{CRR} = \left(1 - \frac{S_c + D_c + I_c}{N_c^\mathit{total}}\right) \times 100\%$
- where $S, D, I$ are substitutions, deletions, and insertions (Lunia et al., 12 Mar 2024, De et al., 28 Nov 2025, Mathew et al., 2021).
End-to-end: F1-score, WRR, and CRR over full scene predictions.

Performance is script-dependent; for instance, on BSTD, fine-tuned PARSeq achieves WRRs ranging from 56% (Telugu) to 92% (English), with South-Indian scripts notably lagging (De et al., 28 Nov 2025). On IndicSTR12, the best baseline reaches WRR ≈ 78% for Punjabi, ≈ 44% for Urdu (Lunia et al., 12 Mar 2024). End-to-end F1 scores for open-source pipelines currently plateau around 0.45–0.51, depending on ground-truth injection at various stages (De et al., 28 Nov 2025).

4. Script-Specific and Multiscript Challenges

Indian language scene text recognition must address unique scriptological and visual obstacles, absent in Latin-script contexts:

Alphabet size: Indic scripts require output vocabularies of 100–250+ characters, several times larger than Latin’s 26 (Lunia et al., 12 Mar 2024).
Glyph structure: Compound characters, ligatures, and “matra” diacritics above, below, or to the side of base graphemes introduce sequence and placement ambiguity (Lunia et al., 12 Mar 2024, De et al., 28 Nov 2025).
Font and appearance variety: Hand-painted, nonstandard typefaces with irregular kerning, stroke width, or incomplete glyphs severely confound feature learning (De et al., 28 Nov 2025, Nag et al., 2018).
Environmental degradation: Occlusion, low light, blur, perspective warping, and background clutter are pervasive, as documented by explicit difficulty tags in IndicSTR12 (Lunia et al., 12 Mar 2024).
Multiscript and code-mixed signage: Co-occurrence of English and regional scripts, or multiple Indian scripts, causes frequent confusion in both script detection and word recognition (De et al., 28 Nov 2025).

Error analysis reveals persistent issues with matra/non-base mark detection at low resolution, cross-script glyph similarity (e.g., Punjabi “ਸ” vs Hindi “स”), hallucinated consonant clusters, and mis-segmentation under close wordline arrangement (Lunia et al., 12 Mar 2024, De et al., 28 Nov 2025).

5. Advances in Training Strategies and Transfer Learning

A substantial body of work demonstrates the critical role of transfer learning and synthetic data in overcoming resource scarcity:

Synthetic pre-training: All state-of-the-art models initiate with multi-million–sized synthetic corpora per script, rendered with wide font, color, scene, and geometric variation (Lunia et al., 12 Mar 2024, Mathew et al., 2021, Gunna et al., 2022).
Cross-script transfer: Transfer learning between Indian scripts (Indic→Indic) yields consistent performance gains, attributed to shared visual and n-gram patterns, as measured by low KL-divergence of n-gram distributions. In contrast, English→Indic transfer is largely ineffective; filters trained on Latin scripts fail to capture key features such as top connectors or matra positions (Gunna et al., 2022).
Fine-tuning on real data: Even modest amounts of carefully annotated, script-specific real data significantly improve WRR/CRR, with absolute gains of 5–26% over synthetic-only models (De et al., 28 Nov 2025, Mathew et al., 2021, Gunna et al., 2022).
Correction modules: The addition of posthoc BiLSTM correction layers can further reduce character-level confusion, particularly in complex scripts such as Bangla (Gunna et al., 2022).
Multilingual and cross-script pre-training: Early results indicate that pooling training across closely related scripts improves generalization, e.g., joint Hindi+Gujarati training yielded +4% WRR on Hindi (Lunia et al., 12 Mar 2024).

6. Limitations and Future Research Directions

Despite marked progress, multiple open research problems persist:

Detection under occlusion and low signal: Current detectors fail in the presence of severe occlusion, small or fragmented text and extreme noise (Nag et al., 2018, De et al., 28 Nov 2025).
Disambiguation of place and language: For end-to-end systems mapping recognized address text to regional language, toponym resolution when multiple candidates exist remains unsolved (Nag et al., 2018).
Data sparsity for low-resource scripts: Languages such as Malayalam, Urdu, and Marathi are under-represented in all real-data corpora, and augmentation pipelines are yet to yield Latin-benchmark parity for these cases (Lunia et al., 12 Mar 2024, De et al., 28 Nov 2025).
Cross-script confusion: Scripts sharing Unicode blocks or similar base glyphs (e.g., Devanagari block for Hindi and Marathi) resist reliable script ID and recognition segmentation (De et al., 28 Nov 2025).
Integration of script awareness: Architectures jointly modeling script, spatial transcription, and detection, rather than composing standalone modules, represent a promising direction (De et al., 28 Nov 2025).
Domain adaptation: The gap between synthetic and real scene performance, especially in complex environments, persists. There is growing interest in GAN-augmented data, multimodal pre-training (CLIP-style objectives for Indic scripts), and self- or semi-supervised adaptation (De et al., 28 Nov 2025, Lunia et al., 12 Mar 2024).

Future work will likely involve larger and more diverse real data acquisition, domain-adaptive architectures, script-specialized modeling that captures sub-character structure, and integration of fine-grained lexical and topological context.

7. Summary Table of Core Model Performance

For reference, typical performance benchmarks (WRR unless noted) across recent large-scale corpora:

Architecture/Data	BSTD (Avg WRR) (De et al., 28 Nov 2025)	IndicSTR12 (Avg WRR) (Lunia et al., 12 Mar 2024)	IIIT-ILST (Hindi) (Mathew et al., 2021, Gunna et al., 2022)
PARSeq (real fine-tune)	0.73	0.44–0.78 (by script)	—
STARNet (fine-tune)	—	0.58–0.91 (synthetic only)	0.47–0.75
CRNN (fine-tune)	—	0.39–0.82 (synthetic only)	0.43
IndicPhotoOCR (end-to-end, OS)	0.36	—	—
Google OCR	0.41	—	—

The Indian language scene text recognition field is thus defined by high script variability, diverse and challenging image quality, and rapid evolution of benchmarks and models. Progress is tightly coupled to the growth of multiscript, real-scene corpora and innovations in script-aware deep learning architectures (De et al., 28 Nov 2025, Lunia et al., 12 Mar 2024, Mathew et al., 2021, Gunna et al., 2022, Nag et al., 2018).