- The paper introduces Naamah, which combines DBpedia seeding and a 24B parameter LLM to generate a silver-standard Sanskrit NER dataset.
- The methodology leverages SPARQL extraction, generative data augmentation, and heuristic postprocessing to ensure BIO-tag consistency.
- Experimental results show IndicBERTv2 achieves an F1 of 0.9615, underscoring the importance of tokenizer and pretraining domain alignment.
Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation
Introduction and Motivation
The paper introduces Naamah, a large-scale synthetic dataset for Named Entity Recognition (NER) in Sanskrit, addressing the acute scarcity of annotated corpora that hinders advances in computational linguistics for classical languages. Given Sanskrit’s morphological richness, extensive inflection, and sandhi phenomena, developing high-quality NER resources is non-trivial. Existing corpora are limited, domain-specific, or compromised by projection errors arising from parallel alignment with structurally divergent languages such as English. Manual annotation is not scalable due to the requirement for specialized expertise.
This work systematically combines entity extraction from DBpedia with generative LLMs tailored for Indic scripts, producing a silver standard dataset of over 102k sentences. The resulting corpus enables robust benchmarking of NER architectures under realistic and morphologically challenging linguistic conditions.
Methodology
Entity Harvesting via DBpedia
A SPARQL-driven pipeline extracts a broad and diverse set of entities—Person, Location, Organization—from DBpedia. Coverage is intentionally global, including non-native and transliterated proper nouns, which discourages heuristics based solely on lexical familiarity and forces models to generalize over complex entity morphology and irregularities in Sanskrit.
Generative Data Augmentation
Moving beyond deterministic morphological engines, the authors deploy Sarvam-M, a 24B parameter hybrid LLM optimized for Indic languages, to generate contextually fluent Sanskrit sentences embedding these entities. This approach bypasses rigid template-based synthesis, instead enabling naturalistic variation in case endings, syntactic constructs, and sandhi, reflective of real Sanskrit usage. Post-generation, heuristic postprocessing ensures token-label consistency, correct BIO tagging, and removes malformed data.
Corpus Composition
The final dataset consists of 102,942 sentences, formatted in JSONL with full BIO-tag annotation and distributed across substantial training and validation splits. The vocabulary is diverse, embracing 123,923 unique tokens and a distribution of 127,397 entities, ensuring robust morphological and semantic variety critical for NER generalization.
Experimental Evaluation
Model Selection and Training
The study benchmarks two transformer-based models:
- XLM-RoBERTa Base: A multilingual encoder with a 250k token shared vocabulary and minimal Sanskrit-specific pretraining.
- IndicBERTv2 MLM Only: A compact, Indic script-optimized transformer (~130MB), leveraging domain vocabulary alignment for improved linguistic fit.
Both models are fine-tuned on the Naamah data using standard NER token classification protocols, notably not relying on pre-processing de-sandhi splitting. Instead, subword tokenization algorithms with BIO label-first assignment (masking subsequent sub-tokens) are employed, challenging models to learn entity boundaries over agglutinated forms.
Results
IndicBERTv2 achieves an F1 of 0.9615 on the 10,295 validation set, surpassing XLM-RoBERTa’s 0.9506. This performance is realized despite IndicBERTv2’s order-of-magnitude smaller parameter count, highlighting the dominance of tokenizer and vocabulary alignment over raw model capacity when addressing languages with rich inflection and compounding morphology.
Further, accuracy and validation loss metrics demonstrate stronger alignment and learning signal in IndicBERTv2, underpinning its robustness for deployment on resource-constrained edge devices.
Qualitative Analysis
The token-level error analysis exposes significant weaknesses in XLM-RoBERTa’s generic tokenizer, which frequently fragments Sanskrit entities (e.g., "Kuruksetre") into subword roots and suffixes. This leads to BIO tag instability and frequent misclassification, especially for suffixes incorrectly attributed different entity types (e.g., organization instead of location). IndicBERTv2’s domain-aligned tokenizer successfully preserves entity spans and semantic coherence, supporting the superior fit of specialized pretraining and vocabulary coverage for this linguistic context.
Implications and Future Directions
The construction of Naamah represents a scalable, domain-adaptive alternative to manual annotation, enabling effective NER for low-resource, morphologically complex languages. The results invalidate the assumption that model scale or multilingual breadth can substitute for tokenizer and pretraining domain fit, especially for tasks demanding high morphological sensitivity.
Avenues for future work include:
- Gold Standard Expansion: Integrating manually annotated Sanskrit texts for error analysis and hybrid training, anchoring the silver standard with expert validation.
- Complex Sandhi Modeling: Enhancing synthetic pipelines to faithfully capture and evaluate multi-word sandhi phenomena prevalent in authentic literature.
- Schema Extensions: Adapting NER label sets and annotation methodologies to historical and philological tasks, including diachronic entity types not present in contemporary corpora.
Conclusion
The Naamah dataset and accompanying generation pipeline significantly advance the state of Sanskrit NER, illustrating the feasibility and necessity of hybrid knowledge-centric and LLM-based synthetic data generation in resource-deficient domains. The demonstrated superiority of IndicBERTv2 over XLM-RoBERTa underscores the criticality of script and domain alignment in both tokenization and model pretraining. In sum, this research enables more nuanced applications in Sanskrit computational linguistics and provides a template for addressing similar challenges in other classical or underresourced languages (2604.26456).