TabiBench: Turkish NLP Benchmark
- TabiBench is a unified Turkish NLP benchmark that standardizes evaluation across 28 datasets and eight task categories using macro-averaged scoring.
- It integrates modern architectural advances from TabiBERT, including rotary positional embeddings, FlashAttention, and GLU expansion to enable efficient and scalable long-context modeling.
- The benchmark demonstrates state-of-the-art performance with notable improvements in question answering, code retrieval, and document retrieval compared to prior Turkish models.
TabiBERT is a monolingual Turkish encoder-only Transformer model built on ModernBERT architectural advancements, pre-trained at unprecedented scale and evaluated on a unified Turkish NLP benchmark. Developed to address the gap in Turkish foundation models that incorporate rotary positional embeddings, efficient long-context attention, and stabilized deep optimization, TabiBERT offers a new computational backbone for Turkish natural language understanding and retrieval, as well as reproducible state-of-the-art results across diverse application domains (Türker et al., 28 Dec 2025).
1. ModernBERT Architecture and Engineering
TabiBERT leverages the "Modern-base" configuration, systematically integrating architectural elements previously established in contemporary large-scale LLMs:
- 22 Transformer encoder blocks with a hidden size of 768 and 12 attention heads, yielding 149 million non-embedding parameters.
- Feed-forward expansion via Gated Linear Units (GLU), where the intermediate size of 1,152 expands to 2,304, affording greater model expressiveness per layer.
- Rotary Positional Embeddings (RoPE): Relative token positions are encoded using element-wise rotations in the complex plane. For hidden dimension , positional index , and , the positional embedding is:
This enables robust modeling of token order and extrapolation to large context windows.
- FlashAttention: The attention mechanism is re-engineered using block-wise matrix operations and buffer reuse to achieve memory for -length sequences. This results in up to speedup at 8,192-token context windows and roughly reduction in GPU peak memory over classical Transformer attention.
- Pre-Norm Layer Normalization: LayerNorm is applied before sublayers (attention/MLP), avoiding gradient instability in deep stacks:
These cumulative design choices enable stable training and scalable inference, with particular advantages in long-context scenarios.
2. Pre-training Corpus and Tokenization
TabiBERT's pre-training utilizes a large-scale, carefully constructed multi-domain corpus for Turkish, comprising 86.58 billion tokens (after oversampling):
- Web text (FineWeb-2 Turkish): 56.04B tokens (73%)
- Scientific publications (Dergipark, Yöktez): 16.62B tokens (20%)
- Source Code (GitHub, CodeBagel, Alpaca): 5.42B tokens (6%)
- Mathematical content (MathQA, MMLU, Competition Math): 0.26B tokens (0.3%)
Specialized corpora such as books, creative texts, and parliamentary transcripts are oversampled (e.g., Books: , ParlaMint-TR: ) to enhance domain diversity. The tokenizer is a Byte-Pair Encoding (BPE) scheme over a vocabulary of 50,176 tokens. Pre-training follows the masked language modeling objective (MLM) with a 30% mask rate, as per the ModernBERT recipe, exposing the model to one trillion tokens over three curriculum phases scaled by :
- Phase 1: 850B tokens; Phase 2: 125B tokens; Phase 3: 25B tokens.
TabiBERT training maximizes batch and context length efficiency by striding data into 8,192-token chunks, reaching an 11.8-epoch effective exposure (Türker et al., 28 Dec 2025).
3. Long-Context Modeling and Computational Efficiency
The combined use of RoPE, FlashAttention, local–global alternating attention, and IO-aware unpadding enables TabiBERT to handle sequence windows up to 8,192 tokens—16 times the capacity of BERT. This long-context support is critical in applications where context distribution exhibits heavy right tails, such as complex document retrieval and comprehensive question answering. Benchmarking indicates that TabiBERT's throughput at maximum context length is that of BERTurk while requiring approximately half the peak GPU memory.
These improvements directly impact downstream applications by enabling end-to-end fine-tuning and inference on lengthy Turkish documents without truncation—a key bottleneck in legacy BERT-based pipelines (Türker et al., 28 Dec 2025).
4. TabiBench: Unified and Standardized Turkish NLP Benchmark
To address the fragmented evaluation landscape in Turkish NLP, TabiBERT introduces TabiBench, a GLUE-style benchmark with 28 datasets covering eight task categories:
- Text classification: NewsCat, BilTweetNews, Gender Hate Speech, Product Reviews
- Token classification: WikiNER, WikiANN-TR, POS UD-IMST, POS UD-BOUN
- Semantic textual similarity: SICK-TR, STSb-TR
- Natural language inference: SNLI-TR, MultiNLI-TR
- Question answering: TQuAD, XQuAD
- Information retrieval: BiText, MsMarco-TR, Scifact-TR, Fiqa-TR, NFCorpus-TR, Quora-TR
- Code retrieval: Apps-TR, CosQA-TR, StackOverflowQA-TR, CodeSearchNet-TR
- Academic domain tasks: MedNLI-TR, PubMedRCT-20K-TR, SciCite-TR, ThesisAbstractClassification-11K
Splits are stratified 70/15/15 when not provided; all models are fine-tuned using a sweep over hyperparameters, and each task uses GLUE-style macro-averaging for score aggregation.
In consolidated evaluation, TabiBERT achieves a “Total Avg” of 77.58, outperforming BERTurk by 1.62 points, and establishes new state-of-the-art on five of eight categories: question answering ( F1), code retrieval ( NDCG), and document retrieval ( NDCG). Compared with the best prior Turkish models (including specialized architectures), the mean improvement is (Türker et al., 28 Dec 2025). Table 2 below summarizes topline scores:
| Model | Text Clf F1 | Token Clf F1 | QA F1 | CodeRet NDCG | Total Avg |
|---|---|---|---|---|---|
| BERTurk | 83.42 | 93.67 | 60.16 | 54.54 | 75.96 |
| TabiBERT | 83.44 | 93.42 | 69.71 | 56.95 | 77.58 |
5. Reproducibility and Research Artifacts
All critical assets for TabiBERT are publicly released to foster transparency and reproducibility:
- Model weights and configuration: available via HuggingFace at https://huggingface.co/boun-tabilab/TabiBERT
- Training/fine-tuning code and data scripts: available on GitHub at https://github.com/boun-tabi-LMG/TabiBERT
- TabiBench dataset and evaluation recipes: accessible through https://huggingface.co/collections/boun-tabilab/tabibench
Standardization across dataset splits, evaluation protocols, and open-source infrastructure positions TabiBERT as a reproducible reference for Turkish encoder research.
6. Contributions and Impact
TabiBERT delivers several substantive contributions to Turkish NLP:
- Architectural parity with contemporary English ModernBERT variants via RoPE, FlashAttention, Pre-Norm LayerNorm, and GLU expansion
- Long-context capacity (8,192 tokens) with high computational efficiency, addressing sequence truncation in downstream tasks
- Large-scale, multi-domain pre-training from scratch on a corpus representative of Turkish web, science, code, and mathematical text
- Unified, open-source benchmarking (TabiBench) that enables systematic macro-averaged comparison across application domains
- State-of-the-art performance in natural language understanding, retrieval, question answering, and code-related tasks for Turkish
These advancements delineate TabiBERT as the foundation model of record for Turkish monolingual and cross-domain NLP research, replacing prior ad hoc or language transfer solutions (Türker et al., 28 Dec 2025).