KyrgyzBERT: Transformer Model for Kyrgyz NLP
- KyrgyzBERT is a monolingual BERT-style model for Kyrgyz, designed to handle agglutinative morphology with a custom WordPiece tokenizer.
- It uses a compact 6-layer transformer with 35.9M parameters, achieving near-parity to mBERT while being resource-efficient.
- The model is fine-tuned on a custom-created kyrgyz-sst2 benchmark for sentiment analysis, providing foundational tools for Kyrgyz NLP research.
KyrgyzBERT is a monolingual BERT-style transformer-based LLM specifically developed for Kyrgyz, an agglutinative and low-resource language. As the first publicly available foundational model for Kyrgyz, KyrgyzBERT provides a compact architecture, a tokenizer custom-tailored for the language’s morphological complexity, and evaluation resources for sentiment analysis. Its development addresses the substantial lack of neural NLP tools available for Kyrgyz, facilitating further research and downstream application development in this language (Metinov et al., 25 Nov 2025).
1. Model Architecture and Parameterization
KyrgyzBERT utilizes a bidirectional Transformer encoder consistent with the BERT pretraining paradigm. The core architectural characteristics are as follows:
- Layers (Transformer blocks): 6
- Hidden dimension: 512
- Attention heads per layer: 8
- Feed-forward inner dimension: 4 × hidden size ≃ 2048 (not explicitly stated, inferred from standard practice)
- Vocabulary size: 30,522 subword tokens
- Total parameters: 35.9 million
The compact design is achieved by halving the depth (6 vs. 12 layers in BERT-Base), reducing the width (512 vs. 768 hidden units), and using fewer attention heads (8 vs. 12). This leads to a parameter count that is approximately one-fifth of mBERT (177 million), resulting in minimal downstream performance loss (≈1.2 percentage points in weighted F1-score on the focal benchmark). This size-performance tradeoff is specifically advantageous for resource-constrained deployments and facilitates rapid experimentation (Metinov et al., 25 Nov 2025).
2. Tokenizer Design and Morphological Considerations
KyrgyzBERT employs “bert-kyrgyz-tokenizer,” a WordPiece-based tokenizer engineered to exploit the agglutinative properties of Kyrgyz. The tokenizer is trained from scratch on a 1.5-million-sentence Kyrgyz corpus.
Key algorithmic steps (as pseudocode):
1 2 3 4 5 6 7 8 |
1. Initialize token set T = all characters in C 2. Repeat until |T| = V: a. Count frequencies of all token-pair concatenations (t_i, t_j) in C b. Select the most frequent pair p = (t_i, t_j) c. Merge p into new token t_new = t_i + t_j d. Replace all occurrences of (t_i, t_j) in C with t_new e. Add t_new to T 3. Output vocabulary T |
The tokenizer is motivated by the morphological richness present in Kyrgyz (e.g., affixation, case markers, possessive endings). Frequent morphemes are reliably coded as subwords, improving the model’s granularity and efficiency. Tokenization during model input employs a greedy longest-match approach to segmentation (Metinov et al., 25 Nov 2025).
3. Pretraining Data and Learning Objectives
The pretraining corpus consists of approximately 1.5 million Kyrgyz sentences aggregated from public and private sources. Details of domain makeup are not fully specified. Sentences underwent normalization, segmentation, and noise filtering prior to model ingestion.
Pretraining is limited to the Masked Language Modeling (MLM) objective, with loss over masked tokens defined as: where denotes indices of masked positions, and represents the input sequence with masked.
Training was executed on a single NVIDIA RTX 3090 GPU. Batch size, learning rate schedule, and number of steps are not detailed in the publication. The optimizer is not explicitly indicated for pretraining, but standard practice with BERT is assumed (e.g., AdamW, linear warmup/decay) (Metinov et al., 25 Nov 2025).
4. Benchmarking and Fine-tuning Protocol
To assess downstream performance, the kyrgyz-sst2 benchmark was constructed by machine-translating the SST-2 (Stanford Sentiment Treebank) train and validation splits and manually re-annotating the test split (1,821 sentences) to correct for translation artifacts.
- Training set: ~67,000 sentences
- Validation set: ~872 sentences
- Test set: 1,821 sentences (manually labeled by a native Kyrgyz speaker)
@@@@3@@@@ involved:
- 3 epochs
- AdamW optimizer
- Learning rate:
- Evaluation via weighted F1-score and accuracy
- Train/validation splits matched those of the translated SST-2
The labeling process for the gold-standard test set ensures fidelity in ground truth sentiment, mitigating machine translation bias (Metinov et al., 25 Nov 2025).
5. Empirical Evaluation and Comparative Analysis
Model performance is summarized in the table below:
| Model | F1-score (weighted) | Size (M parameters) |
|---|---|---|
| KyrgyzBERT (ft) | 0.8280 | 35.9 |
| mBERT (ft) | 0.8401 | 177.0 |
| XLM-R (zero-shot) | 0.3221 | 270.0 |
| mBERT (zero-shot) | 0.3509 | 177.0 |
Fine-tuned mBERT achieves only a slight edge over KyrgyzBERT despite the latter’s significantly reduced parameter count. Zero-shot inference from large multilingual models yields near-random performance, underscoring the necessity of monolingual pretraining and Kyrgyz-specific resources. No formal statistical significance testing is reported, but observed differences are consistent across independent runs. Error analysis highlights persistent challenges in correctly identifying sentiment in sentences containing both negation and idiomatic structures, common in low-resource settings (Metinov et al., 25 Nov 2025).
6. Model Availability and Directions for Further Research
KyrgyzBERT and its associated resources are released via the Hugging Face Hub under the “metinovadilet” namespace. Publicly released artifacts include:
- Pretrained base model
- “bert-kyrgyz-tokenizer”
- Fine-tuned sentiment models (kyrgyzbert_sst2, mbert_kyrgyz_sst2_finetuned)
- kyrgyz-sst2 dataset with gold-standard test set
Acknowledged limitations concern corpus scale (limited size, insufficient domain diversity), evaluation scope (only binary sentiment), and tuning thoroughness (feed-forward size and pretraining hyperparameters not exhaustively explored). Proposed areas for future work encompass:
- Curation of larger and domain-diverse Kyrgyz corpora (news, social media, technical texts)
- Pretraining of deeper/wider models as more data becomes available (e.g., 12-layer, 768-hidden)
- Extension to additional NLP tasks such as NER, dependency parsing, QA, and machine translation
- Augmentation objectives targeting morpheme prediction to further exploit agglutination in Kyrgyz
These directions are aimed at broadening both model capability and downstream task coverage for Kyrgyz NLP (Metinov et al., 25 Nov 2025).