TR-MMLU: Turkish Multitask Language Benchmark
- TR-MMLU is a comprehensive benchmark designed to evaluate Turkish language models with high-quality, natively authored academic content and culturally specific challenges.
- It features a rigorously curated set of 6,200 multiple-choice questions across 62 sections, ensuring linguistic precision and alignment with Turkey's educational standards.
- The benchmark employs zero-shot evaluation and specialized Turkish tokenization metrics to benchmark performance across 39 diverse language models.
The Turkish Massive Multitask Language Understanding (TR-MMLU) benchmark is a comprehensive, natively crafted evaluation suite designed to measure the linguistic and conceptual capabilities of LLMs for Turkish. TR-MMLU addresses the specific challenges of morphologically rich and culturally distinct languages, providing a standard for assessing LLM performance with respect to both factual knowledge and deeper linguistic phenomena. The benchmark incorporates high-quality, curriculum-aligned content spanning a wide array of academic and professional domains rooted in the Turkish education system, and it emphasizes the importance of tokenization strategies tailored for Turkish morphological characteristics (Bayram et al., 2024, Bayram et al., 10 Feb 2025, Isbarov et al., 16 Feb 2025, Yüksel et al., 2024, Bayram et al., 18 Aug 2025).
1. Benchmark Construction and Dataset Characteristics
TR-MMLU distinguishes itself by its scale, native authorship, and granularity of topic coverage. The principal dataset comprises 6,200 multiple-choice questions allocated across 62 sections, themselves derived from a candidate pool of 280,000 items originating from standardized Turkish exams such as the University Entrance Exam, AUZEF, KPSS, TUS, and the national Driver’s License exam. Disciplines span law, medicine, social sciences, history, natural sciences, and professional licensing, covering over 800 unique topics. Approximately 100 questions are distributed per section, and questions are evenly split among three difficulty levels—Easy, Medium, Hard—empirically determined by historical student performance (Bayram et al., 2024, Bayram et al., 18 Aug 2025).
Key dataset construction features include:
- All questions are originally authored by Turkish subject-matter experts, not translated, ensuring direct alignment with Turkish morphosyntax, agglutination, and cultural content.
- Curriculum specialists and linguists perform rigorous filtering to eliminate duplicates, outdated material, and items with potential overlap with LLM pretraining data, thereby preventing data leakage.
- Each question undergoes validation for linguistic correctness (orthography, diacritics, grammar) and contextual appropriateness (legal and historical terms, idiomatic expressions specific to Turkish society) (Bayram et al., 2024, Isbarov et al., 16 Feb 2025).
2. Evaluation Protocol and Metrics
TR-MMLU employs a suite of standardized evaluation protocols:
- Zero-shot evaluation is predominant: models are presented with questions and options without in-context demonstrations or fine-tuning on the evaluation set.
- Prompting is adapted through five Turkish-centric prompt templates. The dominant template requests only the correct answer option:
"Sana soru ve seçenekleri veriyorum. Sadece hangi seçeneğin sorunun doğru cevabı olduğunu yaz." - Scoring: The primary metric is accuracy, defined per model or per section as
Both overall and macro-averaged accuracies across sections are reported, alongside response normalization using semantic similarity checks to handle answer formatting variation via the "paraphrase-multilingual-mpnet-base-v2" model (Bayram et al., 2024, Bayram et al., 18 Aug 2025).
- Section-level analysis enables fine-grained diagnostics, for example on TUS (medical specialization), KPSS (public personnel exams), and domain-specific categories.
Controlled randomness (seed=42) and consistent hardware are maintained for all model evaluations, with detailed logs of processing time and accuracy (Bayram et al., 18 Aug 2025).
3. Model Benchmarks and Performance Analysis
A diverse array of 39 LLMs—both open-source and proprietary—have been systematically benchmarked. Key leaderboard results are summarized:
| Model | Family | Param Size | Correct | Accuracy (%) | Time (s) |
|---|---|---|---|---|---|
| GPT-4o | GPT | - | 5260 | 84.84 | 5021 |
| Claude-3.5 | Sonnet | - | 5233 | 84.40 | 7379 |
| Llama3.3 | Llama | 70.6B | 4924 | 79.42 | 13355 |
| Gemini-1.5 | Gemini | - | 4758 | 76.74 | 4985 |
| Gemma2-27B | Gemma2 | 27.2B | 4470 | 72.10 | 5506 |
Domain-level analysis shows that factual and legal questions (e.g., Driver’s License) yield the highest accuracy (≈97%), while reasoning-intensive sections (e.g., KPSS) show lower scores (as low as 66%). Medical specialization domains approach or surpass 90% for models with relevant fine-tuning. Catastrophic forgetting is observed in models fine-tuned on partial Turkish domains, highlighting the importance of continual learning strategies (Bayram et al., 2024, Bayram et al., 18 Aug 2025).
4. Tokenization Methodologies and Linguistic Challenges
Tokenization is central to Turkish language modeling performance. TR-MMLU explicitly evaluates tokenizer quality using metrics that reflect both computational and linguistic qualities:
- Vocabulary Size (|V|): Larger vocabularies capture rare Turkish forms but can inflate model capacity unnecessarily.
- Token Count (N): Efficient tokenization minimizes sequence length, reducing context window waste.
- Turkish Token Percentage (%TR): Fraction of unique tokens aligning to valid Turkish words, as validated by ITU Web Service and Kalbur.
- Pure Token Percentage (%Pure): Fraction of unique tokens corresponding to atomic Turkish morphemes.
- Processing Time (T): Speed of full corpus tokenization (Bayram et al., 10 Feb 2025).
Empirical results:
- %TR strongly correlates with downstream MMLU accuracy (r = +0.90, p < 0.05), outperforming %Pure (r = +0.68, p ≈ 0.10).
- Each point increase in %TR yields ~+1.26% accuracy, confirming that high linguistic alignment, not mere model scale, determines reasoning performance.
- Gemma-2 achieves a %TR of 48.63% and top open-source accuracy; morphological-informed tokenizers (as in Aya-Expanse, %TR=50.67%) are competitive (Bayram et al., 10 Feb 2025).
Key best practices:
- Aim for Turkish token percentages approaching 50%, requiring vocabularies of 200,000–300,000 subwords.
- Incorporate morphological analyzers (e.g., Kalbur) in vocabulary construction.
- Validate all output tokens against a native Turkish lexicon to maximize semantic fidelity (Bayram et al., 10 Feb 2025).
5. Comparative Evaluation and Related Benchmarks
TR-MMLU links closely with broader Turkic and Turkish evaluation initiatives:
- TUMLU extends the MMLU paradigm to Turkic languages using questions natively authored in Turkish, thus avoiding translation artifacts and maintaining morphosyntactic fidelity (Isbarov et al., 16 Feb 2025).
- TurkishMMLU provides a larger (10,032 questions) high-school–level benchmark, drawn from the EBA platform of the Turkish Ministry of Education, with explicit empirical difficulty calibration based on student correctness ratios (Yüksel et al., 2024).
Performance on these related benchmarks reaffirms that top proprietary LLMs (GPT-4o, Claude-3.5) attain ≈85% accuracy with few-shot or Chain-of-Thought prompting, while open-source LLMs lag (~70%). Chain-of-Thought yields only modest improvements except for select subjects (notably Turkish Literature), and mathematical reasoning remains a clear deficit.
6. Error Analysis, Limitations, and Recommendations
Several error modes hamper model performance:
- Mis-segmentation of Turkish agglutinated or morphologically complex tokens leads to 5–10% drops in accuracy.
- Culturally nuanced distractors and reasoning-heavy questions (multi-step inference, paraphrasing, synonym/antonym discrimination) yield high rates of model confusion.
- Catastrophic forgetting during Turkish-specific fine-tuning is prevalent unless continual learning techniques (such as Elastic Weight Consolidation) are employed.
- Difficulty is systematically higher in native Turkish items compared to machine-translated Turkic datasets, complicating cross-linguistic benchmarking (Bayram et al., 2024, Bayram et al., 10 Feb 2025, Isbarov et al., 16 Feb 2025, Bayram et al., 18 Aug 2025).
Recommended development directions:
- Invest in Turkish-aware subword tokenizer design with explicit modeling of agglutination and vowel harmony.
- Explore robust fine-tuning protocols (layer freezing, curriculum learning) for reduced forgetting.
- Extend evaluation beyond multiple-choice to open-ended, generative, and multimodal tasks.
- Augment datasets via controlled morphological transformations and synthetic data augmentation (Bayram et al., 2024, Bayram et al., 10 Feb 2025, Bayram et al., 18 Aug 2025).
7. Cultural and Linguistic Relevance
TR-MMLU ensures representativeness not only through linguistic alignment but also cultural authenticity:
- Questions systematically cover local canonical knowledge (e.g., Ottoman history, Anatolian geography), idiomatic usage, and statutory terminology.
- Crafting by Turkish educators and experts maintains morphosyntactic awareness (agglutination, suffixation, vowel harmony), demanding deeper understanding from LLMs than dictionary lookups or translated corpora provide (Bayram et al., 2024, Isbarov et al., 16 Feb 2025).
Conclusion
TR-MMLU establishes a transparent, rigorous, and natively aligned benchmark for Turkish NLP, effectively extending the multi-paradigm evaluation framework of MMLU to a morphologically and culturally complex resource-limited language. Its methodological rigor in question curation, tokenization assessment, and multi-domain coverage offers a template for similar benchmarks in other low-resource languages. While state-of-the-art LLMs demonstrate strong baseline reasoning, persistent challenges exist in mathematical reasoning, handling linguistic nuance, and mitigating domain-specific catastrophic forgetting. Future work will benefit from continued expansion into open-ended tasks, even tighter linguistic alignment, and elaborated qualitative error analyses, driving equitable advances in Turkish—and broader multilingual—language modeling (Bayram et al., 2024, Bayram et al., 10 Feb 2025, Isbarov et al., 16 Feb 2025, Yüksel et al., 2024, Bayram et al., 18 Aug 2025).