Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

Published 29 Apr 2026 in cs.CL and cs.AI | (2604.26456v1)

Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic LLMs for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces Naamah, which combines DBpedia seeding and a 24B parameter LLM to generate a silver-standard Sanskrit NER dataset.
The methodology leverages SPARQL extraction, generative data augmentation, and heuristic postprocessing to ensure BIO-tag consistency.
Experimental results show IndicBERTv2 achieves an F1 of 0.9615, underscoring the importance of tokenizer and pretraining domain alignment.

Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

Introduction and Motivation

The paper introduces Naamah, a large-scale synthetic dataset for Named Entity Recognition (NER) in Sanskrit, addressing the acute scarcity of annotated corpora that hinders advances in computational linguistics for classical languages. Given Sanskrit’s morphological richness, extensive inflection, and sandhi phenomena, developing high-quality NER resources is non-trivial. Existing corpora are limited, domain-specific, or compromised by projection errors arising from parallel alignment with structurally divergent languages such as English. Manual annotation is not scalable due to the requirement for specialized expertise.

This work systematically combines entity extraction from DBpedia with generative LLMs tailored for Indic scripts, producing a silver standard dataset of over 102k sentences. The resulting corpus enables robust benchmarking of NER architectures under realistic and morphologically challenging linguistic conditions.

Methodology

Entity Harvesting via DBpedia

A SPARQL-driven pipeline extracts a broad and diverse set of entities—Person, Location, Organization—from DBpedia. Coverage is intentionally global, including non-native and transliterated proper nouns, which discourages heuristics based solely on lexical familiarity and forces models to generalize over complex entity morphology and irregularities in Sanskrit.

Generative Data Augmentation

Moving beyond deterministic morphological engines, the authors deploy Sarvam-M, a 24B parameter hybrid LLM optimized for Indic languages, to generate contextually fluent Sanskrit sentences embedding these entities. This approach bypasses rigid template-based synthesis, instead enabling naturalistic variation in case endings, syntactic constructs, and sandhi, reflective of real Sanskrit usage. Post-generation, heuristic postprocessing ensures token-label consistency, correct BIO tagging, and removes malformed data.

Corpus Composition

The final dataset consists of 102,942 sentences, formatted in JSONL with full BIO-tag annotation and distributed across substantial training and validation splits. The vocabulary is diverse, embracing 123,923 unique tokens and a distribution of 127,397 entities, ensuring robust morphological and semantic variety critical for NER generalization.

Experimental Evaluation

Model Selection and Training

The study benchmarks two transformer-based models:

XLM-RoBERTa Base: A multilingual encoder with a 250k token shared vocabulary and minimal Sanskrit-specific pretraining.
IndicBERTv2 MLM Only: A compact, Indic script-optimized transformer (~130MB), leveraging domain vocabulary alignment for improved linguistic fit.

Both models are fine-tuned on the Naamah data using standard NER token classification protocols, notably not relying on pre-processing de-sandhi splitting. Instead, subword tokenization algorithms with BIO label-first assignment (masking subsequent sub-tokens) are employed, challenging models to learn entity boundaries over agglutinated forms.

Results

IndicBERTv2 achieves an F1 of 0.9615 on the 10,295 validation set, surpassing XLM-RoBERTa’s 0.9506. This performance is realized despite IndicBERTv2’s order-of-magnitude smaller parameter count, highlighting the dominance of tokenizer and vocabulary alignment over raw model capacity when addressing languages with rich inflection and compounding morphology.

Further, accuracy and validation loss metrics demonstrate stronger alignment and learning signal in IndicBERTv2, underpinning its robustness for deployment on resource-constrained edge devices.

Qualitative Analysis

The token-level error analysis exposes significant weaknesses in XLM-RoBERTa’s generic tokenizer, which frequently fragments Sanskrit entities (e.g., "Kuruksetre") into subword roots and suffixes. This leads to BIO tag instability and frequent misclassification, especially for suffixes incorrectly attributed different entity types (e.g., organization instead of location). IndicBERTv2’s domain-aligned tokenizer successfully preserves entity spans and semantic coherence, supporting the superior fit of specialized pretraining and vocabulary coverage for this linguistic context.

Implications and Future Directions

The construction of Naamah represents a scalable, domain-adaptive alternative to manual annotation, enabling effective NER for low-resource, morphologically complex languages. The results invalidate the assumption that model scale or multilingual breadth can substitute for tokenizer and pretraining domain fit, especially for tasks demanding high morphological sensitivity.

Avenues for future work include:

Gold Standard Expansion: Integrating manually annotated Sanskrit texts for error analysis and hybrid training, anchoring the silver standard with expert validation.
Complex Sandhi Modeling: Enhancing synthetic pipelines to faithfully capture and evaluate multi-word sandhi phenomena prevalent in authentic literature.
Schema Extensions: Adapting NER label sets and annotation methodologies to historical and philological tasks, including diachronic entity types not present in contemporary corpora.

Conclusion

The Naamah dataset and accompanying generation pipeline significantly advance the state of Sanskrit NER, illustrating the feasibility and necessity of hybrid knowledge-centric and LLM-based synthetic data generation in resource-deficient domains. The demonstrated superiority of IndicBERTv2 over XLM-RoBERTa underscores the criticality of script and domain alignment in both tokenization and model pretraining. In sum, this research enables more nuanced applications in Sanskrit computational linguistics and provides a template for addressing similar challenges in other classical or underresourced languages (2604.26456).

Markdown Report Issue