KyrgyzBERT: Transformer Model for Kyrgyz NLP

Updated 2 December 2025

KyrgyzBERT is a monolingual BERT-style model for Kyrgyz, designed to handle agglutinative morphology with a custom WordPiece tokenizer.
It uses a compact 6-layer transformer with 35.9M parameters, achieving near-parity to mBERT while being resource-efficient.
The model is fine-tuned on a custom-created kyrgyz-sst2 benchmark for sentiment analysis, providing foundational tools for Kyrgyz NLP research.

KyrgyzBERT is a monolingual BERT-style transformer-based LLM specifically developed for Kyrgyz, an agglutinative and low-resource language. As the first publicly available foundational model for Kyrgyz, KyrgyzBERT provides a compact architecture, a tokenizer custom-tailored for the language’s morphological complexity, and evaluation resources for sentiment analysis. Its development addresses the substantial lack of neural NLP tools available for Kyrgyz, facilitating further research and downstream application development in this language (Metinov et al., 25 Nov 2025).

1. Model Architecture and Parameterization

KyrgyzBERT utilizes a bidirectional Transformer encoder consistent with the BERT pretraining paradigm. The core architectural characteristics are as follows:

Layers (Transformer blocks): 6
Hidden dimension: 512
Attention heads per layer: 8
Feed-forward inner dimension: 4 × hidden size ≃ 2048 (not explicitly stated, inferred from standard practice)
Vocabulary size: 30,522 subword tokens
Total parameters: 35.9 million

The compact design is achieved by halving the depth (6 vs. 12 layers in BERT-Base), reducing the width (512 vs. 768 hidden units), and using fewer attention heads (8 vs. 12). This leads to a parameter count that is approximately one-fifth of mBERT (177 million), resulting in minimal downstream performance loss (≈1.2 percentage points in weighted F1-score on the focal benchmark). This size-performance tradeoff is specifically advantageous for resource-constrained deployments and facilitates rapid experimentation (Metinov et al., 25 Nov 2025).

2. Tokenizer Design and Morphological Considerations

KyrgyzBERT employs “bert-kyrgyz-tokenizer,” a WordPiece-based tokenizer engineered to exploit the agglutinative properties of Kyrgyz. The tokenizer is trained from scratch on a 1.5-million-sentence Kyrgyz corpus.

Key algorithmic steps (as pseudocode):

1. Initialize token set T = all characters in C
2. Repeat until |T| = V:
    a. Count frequencies of all token-pair concatenations (t_i, t_j) in C
    b. Select the most frequent pair p = (t_i, t_j)
    c. Merge p into new token t_new = t_i + t_j
    d. Replace all occurrences of (t_i, t_j) in C with t_new
    e. Add t_new to T
3. Output vocabulary T

The tokenizer is motivated by the morphological richness present in Kyrgyz (e.g., affixation, case markers, possessive endings). Frequent morphemes are reliably coded as subwords, improving the model’s granularity and efficiency. Tokenization during model input employs a greedy longest-match approach to segmentation (Metinov et al., 25 Nov 2025).

3. Pretraining Data and Learning Objectives

The pretraining corpus consists of approximately 1.5 million Kyrgyz sentences aggregated from public and private sources. Details of domain makeup are not fully specified. Sentences underwent normalization, segmentation, and noise filtering prior to model ingestion.

Pretraining is limited to the Masked Language Modeling (MLM) objective, with loss over masked tokens defined as: $L_{MLM} = - \frac{1}{|M|} \sum_{i \in M} \log P(w_i \mid w_{/i})$ where $M$ denotes indices of masked positions, and $w_{/i}$ represents the input sequence with $w_i$ masked.

Training was executed on a single NVIDIA RTX 3090 GPU. Batch size, learning rate schedule, and number of steps are not detailed in the publication. The optimizer is not explicitly indicated for pretraining, but standard practice with BERT is assumed (e.g., AdamW, linear warmup/decay) (Metinov et al., 25 Nov 2025).

4. Benchmarking and Fine-tuning Protocol

To assess downstream performance, the kyrgyz-sst2 benchmark was constructed by machine-translating the SST-2 (Stanford Sentiment Treebank) train and validation splits and manually re-annotating the test split (1,821 sentences) to correct for translation artifacts.

Training set: ~67,000 sentences
Validation set: ~872 sentences
Test set: 1,821 sentences (manually labeled by a native Kyrgyz speaker)

@@@@3@@@@ involved:

3 epochs
AdamW optimizer
Learning rate: $2 \times 10^{-5}$
Evaluation via weighted F1-score and accuracy
Train/validation splits matched those of the translated SST-2

The labeling process for the gold-standard test set ensures fidelity in ground truth sentiment, mitigating machine translation bias (Metinov et al., 25 Nov 2025).

5. Empirical Evaluation and Comparative Analysis

Model performance is summarized in the table below:

Model	F1-score (weighted)	Size (M parameters)
KyrgyzBERT (ft)	0.8280	35.9
mBERT (ft)	0.8401	177.0
XLM-R (zero-shot)	0.3221	270.0
mBERT (zero-shot)	0.3509	177.0

Fine-tuned mBERT achieves only a slight edge over KyrgyzBERT despite the latter’s significantly reduced parameter count. Zero-shot inference from large multilingual models yields near-random performance, underscoring the necessity of monolingual pretraining and Kyrgyz-specific resources. No formal statistical significance testing is reported, but observed differences are consistent across independent runs. Error analysis highlights persistent challenges in correctly identifying sentiment in sentences containing both negation and idiomatic structures, common in low-resource settings (Metinov et al., 25 Nov 2025).

6. Model Availability and Directions for Further Research

KyrgyzBERT and its associated resources are released via the Hugging Face Hub under the “metinovadilet” namespace. Publicly released artifacts include:

Pretrained base model
“bert-kyrgyz-tokenizer”
Fine-tuned sentiment models (kyrgyzbert_sst2, mbert_kyrgyz_sst2_finetuned)
kyrgyz-sst2 dataset with gold-standard test set

Acknowledged limitations concern corpus scale (limited size, insufficient domain diversity), evaluation scope (only binary sentiment), and tuning thoroughness (feed-forward size and pretraining hyperparameters not exhaustively explored). Proposed areas for future work encompass:

Curation of larger and domain-diverse Kyrgyz corpora (news, social media, technical texts)
Pretraining of deeper/wider models as more data becomes available (e.g., 12-layer, 768-hidden)
Extension to additional NLP tasks such as NER, dependency parsing, QA, and machine translation
Augmentation objectives targeting morpheme prediction to further exploit agglutination in Kyrgyz

These directions are aimed at broadening both model capability and downstream task coverage for Kyrgyz NLP (Metinov et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to KyrgyzBERT.