TabiBERT: Modern Turkish NLP Transformer
- TabiBERT is a modern monolingual Turkish Transformer model that enhances NLP with long-context support and a unified GLUE-style benchmark, TabiBench.
- It implements 22 encoder layers using Rotary Positional Embeddings and FlashAttention, and is trained from scratch on a rigorously curated 84.88B-token Turkish corpus.
- TabiBERT achieves state-of-the-art performance on Turkish NLU and retrieval tasks, outperforming previous models with notable improvements in QA and code retrieval metrics.
TabiBERT is a monolingual Turkish encoder-only Transformer model based on the ModernBERT architectural family, designed to advance Turkish NLP by providing large-scale, scratch-trained representations and standardized evaluation. Leveraging advances such as Rotary Positional Embeddings (RoPE), FlashAttention, and pre-norm layer normalization, TabiBERT is trained on one trillion tokens sampled from a rigorously curated, multi-domain 84.88B-token Turkish corpus. Its release introduces both a state-of-the-art backbone and a unified GLUE-style benchmark (TabiBench) spanning 28 datasets and eight task categories, setting new standards for Turkish NLU and retrieval models (Türker et al., 28 Dec 2025).
1. Architectural Innovations and Model Configuration
TabiBERT implements the "Modern-base" configuration, embodying architectural upgrades that support both scalability and computational efficiency:
- Transformer Depth and Capacity: 22 encoder blocks, hidden size 768, intermediate size 1 152 (GLU-expanded to 2 304), and 12 attention heads. The vocabulary consists of 50 176 BPE tokens, yielding 149 million non-embedding parameters.
- Rotary Positional Embeddings (RoPE): Unlike absolute position embeddings, RoPE introduces relative position through complex-valued rotations. Each even–odd pair of hidden dimensions at position is transformed as
enabling stable extrapolation to much longer contexts.
- FlashAttention: Replaces standard self-attention by blockwise fusion of QK, softmax, and downstream projection, reducing forward-pass memory from to and boosting inference throughput by up to for very long sequences. This permits batch sizes and context lengths beyond what legacy architectures can sustain.
- Pre-Norm Layer Normalization: Layer normalization precedes each sublayer (self-attention/feed-forward), supporting deeper networks and more stable optimization:
facilitating convergence in wider/deeper stacks.
2. Pre-Training Corpus and Regimen
TabiBERT is trained from scratch on a multi-domain, extensively deduplicated Turkish corpus with careful domain balancing and oversampling strategies:
| Corpus | Tokens (B) | Sampling Rate |
|---|---|---|
| FineWeb2-Turkish | 56.04 | 1× |
| Scientific | 16.62 | 1× |
| Source Code | 5.42 | 1× |
| Mathematical Content | 0.26 | 1× |
| Books | 0.62 | 3× |
| ParlaMint-TR | 0.068 | 5× |
- Total Training Exposure: 1 trillion tokens across three pre-training phases (850B, 125B, 25B tokens), each phase repeated ModernBERT standard recipe for coverage.
- Preprocessing: BPE vocabulary, pre-tokenization, and sequence striding into -token windows.
- Optimization: Masked LLM (MLM) objective with a 30% mask rate, StableAdamW optimizer, trapezoidal warm-up, decay schedule. Full pre-training completed in 117 GPU-hours on 8×H100.
3. Long-Context Support and Computational Efficiency
By integrating RoPE and FlashAttention, TabiBERT can natively process up to tokens per sequence—16 times the maximum context of BERT-base/large and comparable Turkish benchmarks.
- Inference Speedup: At the -token context limit, TabiBERT attains higher throughput and lower peak GPU memory than BERTurk, with parity at 512 tokens.
- Downstream Relevance: Extended context lengths enable TabiBERT to address Turkish QA and reading-comprehension tasks (e.g., TQuAD) where 95th-percentile input lengths exceed legacy models’ windows, eliminating the need for truncation and context loss.
4. TabiBench: Standardized Evaluation Framework
TabiBench is introduced to unify Turkish language understanding and retrieval evaluation, spanning 28 datasets in 8 categories:
- Categories: Text classification, token classification, semantic textual similarity, natural language inference, question answering, information retrieval, code retrieval, and academic domain tasks.
- Protocols: Stratified 70/15/15 splits (if official not provided), macro-averaged GLUE-style aggregation, uniform hyperparameter sweeps for all models, and per-category performance breakdown.
- Performance Summary:
- Overall Total Avg: 77.58 (TabiBERT) vs. 75.96 (BERTurk).
- Notable TabiBERT improvements: QA (+9.55 F1), code retrieval (+2.41 NDCG), document retrieval (+0.60 NDCG).
- TabiBERT surpasses per-task best Turkish models, including TurkishBERTweet, by +1.47 points on average.
| Model | Text Clf F1 | Token Clf F1 | QA F1 | CodeRet NDCG | Total Avg |
|---|---|---|---|---|---|
| TurkishBERTweet | 79.71 | 92.02 | 38.13 | 43.49 | 67.48 |
| YTU-Cosmos-BERT | 84.25 | 93.60 | 31.50 | 53.80 | 72.26 |
| BERTurk | 83.42 | 93.67 | 60.16 | 54.54 | 75.96 |
| TabiBERT | 83.44 | 93.42 | 69.71 | 56.95 | 77.58 |
5. Cross-Domain Generalization and Task-Specific Outcomes
TabiBERT exhibits robust cross-domain adaptation, attaining state-of-the-art across five of eight TabiBench categories, including substantial gains in Turkish QA, code, and document retrieval tasks. Improvements are attributable to:
- Model Capacity and RoPE Extrapolation: The architecture’s representational power and context range.
- Corpus Diversity: Inclusion and oversampling of specialized domains (science, code, math, legal, creative writing).
- Long-Context Window: Untruncated context essential for complex document-level tasks, particularly reading comprehension and retrieval benchmarks.
A plausible implication is that the architecture, tokenization, and corpus composite jointly optimize for both head and tail domain coverage in Turkish-language data, aligning TabiBERT’s generalization profile with ModernBERT’s English benchmarks.
6. Reproducibility, Availability, and Ecosystem Impact
All pre-trained model weights, codebases, and benchmarking pipelines are released:
- Weights and Configurations: HuggingFace Model Hub
- Code: Pre-training and fine-tuning repositories, data scripts on GitHub
- Benchmark: TabiBench on HuggingFace Datasets
- Standardized Protocols: Each TabiBench dataset includes split scripts, clear annotation guidelines, and validated evaluation code to ensure replicability.
7. Significance and Research Context
TabiBERT represents the first monolingual Turkish encoder to realize ModernBERT’s contemporary design patterns in a foundation-model format with open, reproducible benchmarking. These advances enable:
- Stable and efficient deep encoder training and inference
- Direct long-context deployment in Turkish NLU and retrieval domains
- Transparent, replicable, and task-agnostic empirical evaluation
This foundation raises standardized Turkish NLP evaluation to parity with English ModernBERT benchmarks and lowers the barrier for further architectural and application-driven research in the Turkish language community (Türker et al., 28 Dec 2025).