NagaNLP: Toolkit for Nagamese NLP
- NagaNLP is an open-source toolkit for Nagamese that combats digital data scarcity using a synthetic-hybrid corpus generation approach.
- It provides curated datasets, fine-tuned discriminative and generative models, and annotation tools that achieve state-of-the-art benchmarks in POS tagging and NER.
- Its scalable, expert-guided pipeline offers a transferable blueprint for creating NLP resources in severely under-resourced languages.
NagaNLP is the first comprehensive open-source toolkit designed to advance NLP for Nagamese, an Assamese-lexified creole with minimal digital representation. Developed to address the acute data scarcity that impedes the creation of NLP models for low-resource languages, NagaNLP employs a synthetic-hybrid corpus generation method that combines LLM-driven data synthesis with rigorous human validation. The release includes datasets, fine-tuned discriminative and generative models, and a full suite of code and annotation tools, constituting both a fundamental resource for Nagamese and a transferable paradigm for other severely under-resourced languages (Maiti et al., 14 Dec 2025).
1. Motivation and Scope
Nagamese, a creole spoken throughout Nagaland, exemplifies languages that face the “digital cliff”—an absence of accessible corpora, annotated datasets, or preexisting models, creating a self-reinforcing barrier for technological development. Prior to NagaNLP, digital resources for Nagamese were limited to a single, manually constructed Part-of-Speech (POS) dataset. The NagaNLP toolkit directly targets this “chicken-and-egg” problem by enabling data and model creation via a scalable expert-guided, LLM-to-human synthetic data pipeline. This hybrid approach seeks to bootstrap foundational NLP resources for languages starting from a practical zero-resource baseline. All artifacts are made available under an open-source license, specifically to provide both infrastructure for Nagamese and a generalizable workflow for similar contexts (Maiti et al., 14 Dec 2025).
2. Synthetic-Hybrid Data Generation Pipeline
At the core of NagaNLP lies a structured, multi-stage pipeline for generating and annotating data:
- Stage 1 (Persona and Task Definition): Gemini 2.5 Pro is configured as an “AI linguist” with explicit knowledge acquisition objectives, eliciting language structure by interacting with a human expert rather than performing generic text synthesis.
- Stage 2 (Interactive Grammatical Elicitation): The expert presents authentic Nagamese texts; Gemini actively queries, posits grammatical generalizations, and is iteratively corrected to enhance output authenticity.
- Stage 3 (Knowledge Consolidation): Extracted rules and structures are distilled by Gemini into a formalized grammar representation to guide subsequent synthetic generation.
- Stage 4 (Scaled Data Generation): Using few-shot prompting and iterative grammar rule reinforcement, Gemini yields two primary resources: a collection of declarative sentences (for annotation) and 10,018 conversational instruction-response pairs (for generative fine-tuning), each formatted as JSONL.
Human-in-the-loop validation is integral at all stages. Four annotators (three native, one fluent) review generated outputs. Declarative sentences pass through human correction, LLM-assisted preliminary POS and Named Entity Recognition (NER) annotation, then final human review, yielding sequences labeled with Universal Dependencies v2 POS tags (17 classes) and IOB2-style NER tags (PER, LOC, ORG, MISC). Inter-annotator agreement is high: Cohen’s κ = 0.92 (POS) and κ = 0.88 (NER), indicating near-perfect and substantial agreement, respectively (Maiti et al., 14 Dec 2025).
3. Released Datasets
NagaNLP introduces two principal corpora:
| Dataset | Size | Format | Primary Use |
|---|---|---|---|
| Conversational Corpus | 10,018 pairs | JSONL (inst-resp) | Generative modeling |
| Annotated POS/NER Corpus | 214 sentences (4,839 tokens) | CONLL-style | Discriminative tasks (POS, NER) |
The conversational dataset encompasses 311,684 tokens and a unique vocabulary of 22,998, split 80/10/10 for train/dev/test. Each entry provides a Nagamese “instruction” and its corresponding “response,” post-edited for grammaticality and idiomaticity. The annotated corpus, split 171/21/22 for train/dev/test, provides gold-standard labels for POS and NER tasks, employing universal tag sets and guidelines for code-switching phenomena. Tag/entity distributions demonstrate linguistic coverage, e.g., 19% nouns, 16% verbs; for NER, 36% MISC, 32% LOC (Maiti et al., 14 Dec 2025).
4. Model Architectures, Training, and Benchmarks
Discriminative Models
NagaNLP leverages both classical and transformer-based models for its discriminative tasks:
- Architectures: bert-base-multilingual-cased and xlm-roberta-base are fine-tuned with AdamW (LR=2e-5, weight decay=0.01, batch size 16) for up to 20 epochs with macro-F1-based checkpointing.
- Baselines: Performance is contextualized against zero-shot xlm-roberta-large and CRF-based POS/NER systems, including prior work and replication on new data.
- Results:
- POS Tagging: Fine-tuned xlm-roberta-base achieves 93.81% accuracy and 0.90 macro F1—establishing a new state-of-the-art for Nagamese POS tagging. The previous CRF system yielded 85.70%/0.86; retrained on the new data, CRF attains 93.84%/0.91.
- NER: xlm-roberta-base reaches 95.13% strict accuracy and 0.75 macro F1 (first such benchmark for Nagamese), outperforming both multilingual transformer and CRF baselines.
- Zero-shot transformer performance remains minimal (≈0% for NER), demonstrating the necessity of in-language training data.
Generative Model: NagaLLaMA
NagaLLaMA, a Llama-3.2-3B-Instruct derivative, is adapted via LoRA (r=16, α=32, dropout=0.05) applied to key projection layers, trained for 3 epochs (LR=2e-4, effective batch size 16). Notable outcomes:
- Perplexity: NagaLLaMA attains a perplexity of 3.85, compared to 96.76 for the few-shot Llama baseline, signifying dramatically improved fluency.
- ROUGE-L: Increases from 11.28 (few-shot) to 20.77 (NagaLLaMA).
- Machine Translation Benchmarks: For Eng→Nag, BLEU = 14.25 (vs. 1.64 for NLLB-200), chrF++=41.83, COMET = 0.6668; for Nag→Eng, BLEU = 34.97, chrF++ = 53.17, COMET = 0.7338.
- Data Scaling Effects: Perplexity decreases monotonically as more synthetic-hybrid data is added (5.33 at 25% data, 3.85 at full size), indicating continual benefit from increased data volume.
Qualitative assessment confirms that NagaLLaMA is capable of context-aware generation in Nagamese, including code-switching and idiomatic usage, surpassing few-shot LLM outputs (Maiti et al., 14 Dec 2025).
5. Toolkit Components and Usage
The public release of NagaNLP comprises:
- Data:
- NagaNLP Annotated Corpus (POS + NER, 214 sentences, CONLL format)
- Conversational Corpus (10,018 JSONL pairs)
- Models:
- Fine-tuned checkpoints for bert-base-multilingual-cased, xlm-roberta-base, and CRF POS/NER models
- NagaLLaMA (Llama-3.2-3B with LoRA adapters)
- Code and Scripts:
- Complete synthetic data generation workflow (Stages 1–4)
- Annotation tools/guidelines for POS and NER
- Training scripts for discriminative and generative models (HuggingFace Transformers, LoRA fine-tuning)
- Standardized evaluation scripts (seqeval for sequence labeling, ROUGE, BLEU, COMET)
- Installation and Replication:
- Environment files (Conda/YAML, Python 3.10)
- Standard install:
pip install naganlpor git-based deployment - Configuration templates support rapid adaptation of the full pipeline to any zero-resource language
All assets are designed to be modular, facilitating extensibility and reproducibility for both Nagamese and other extremely low-resource linguistic contexts (Maiti et al., 14 Dec 2025).
6. Implications and Generalizability
NagaNLP demonstrates that rapid NLP bootstrapping for creole and other typologically complex, digitally marginalized languages is viable through expert-guided, LLM-augmented synthetic data pipelines coupled with rigorous human validation. By establishing robust new benchmarks—POS accuracy (93.81%), NER macro F1 (0.75), NagaLLaMA perplexity (3.85)—it provides both empirical evidence and infrastructural foundation for further research in underrepresented language technologies. The systematic, adaptable workflow, supported by open-source release, constitutes a transferable blueprint for enabling NLP resources in languages with near-zero pre-existing digital presence (Maiti et al., 14 Dec 2025).
7. Significance for the NLP Community
By bridging methodological advances in LLM prompting, synthetic data curation, and annotation best practices, NagaNLP directly addresses a major bottleneck in NLP: data scarcity in the global majority of languages. Its empirical achievements in Nagamese establish new state-of-the-art baselines and showcase the effectiveness of synthetic-hybrid strategies where conventional manual annotation is infeasible. The toolkit and its documentation enable both domain experts and applied practitioners to replicate and adapt the approach, setting a precedent for sustainable, community-informed NLP resource development for under-resourced and endangered languages (Maiti et al., 14 Dec 2025).