TR-MMLU Benchmark: Evaluating Turkish LLMs

Updated 3 February 2026

TR-MMLU Benchmark is a comprehensive evaluation suite for Turkish LLMs, featuring native questions sourced from standardized exams.
It employs curriculum alignment, rigorous annotation, and Turkish-specific tokenization metrics to overcome translation artifacts and morphological challenges.
Key metrics include micro/macro accuracies and token quality, with models achieving up to 84% accuracy, showcasing improved performance with tailored tokenizers.

The Turkish Massive Multitask Language Understanding (TR-MMLU) benchmark is a comprehensive, high-fidelity suite for evaluating LLMs’ (LLMs) reasoning, linguistic, and knowledge capabilities in Turkish. Built to address the limitations of machine-translated benchmarks and the unique demands of Turkish’s agglutinative morphology, it combines native, curriculum-aligned multiple-choice questions with task-specific metrics—including models’ tokenization quality—to both benchmark and drive progress in Turkish NLP and LLM evaluation (Bayram et al., 10 Feb 2025, Bayram et al., 2024, Bayram et al., 18 Aug 2025, Yüksel et al., 2024).

1. Motivation and Background

Historically, MMLU-style benchmarks—originally built in English—have been ported to other languages through machine translation, which introduces artifacts, translation errors, and cultural incongruities. This approach obscures true LLM capability, especially for Turkish, whose morphosyntactic complexity and cultural content are poorly represented by out-of-the-box translation pipelines (Singh et al., 2024, Plaza et al., 2024). TR-MMLU was designed to address these limitations using native Turkish questions, rigorous annotation, and linguistic validation (Bayram et al., 2024, Bayram et al., 10 Feb 2025).

2. Dataset Construction and Characteristics

TR-MMLU consists of 6,200 native Turkish multiple-choice questions uniformly distributed across 62 sections (or “bölüm”), reflecting the breadth and depth of the Turkish education and exam system (Bayram et al., 2024, Bayram et al., 18 Aug 2025):

Source pool: Over 280,000 questions curated from standardized national exams (University Entrance, AUZEF, KPSS, TUS, etc.) and refined by Turkish curriculum experts.
Sectional coverage: Encompasses natural sciences (biology, chemistry, physics), mathematics, social sciences, law, history, health, arts, and technical fields.
Cultural and linguistic grounding: No machine-translated questions. All items were authored or verified by subject-matter experts to ensure authenticity and curricular fit.
Difficulty balancing: Questions span easy, medium, and hard difficulty tiers calibrated from student response statistics (Bayram et al., 18 Aug 2025, Yüksel et al., 2024).
Format and metadata: Each entry provides a question stem, four answer options (A–D), the correct key, section/topic, and difficulty statistics. For fair evaluation, the dataset avoids overlap with common pretraining corpora.

3. Evaluation Protocols and Metrics

TR-MMLU’s central task is multiple-choice question answering, typically evaluated under zero-shot and few-shot regimes:

Overall (micro) accuracy:

$\mathrm{Acc}_{\mathrm{micro}} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat{y}_i = y_i)$

where $N$ is the number of questions and $\mathbf{1}(\cdot)$ is the indicator function.

Section and macro accuracy:

$\mathrm{Acc}_s = \frac{1}{|S_s|}\sum_{i \in S_s} \mathbf{1}(\hat{y}_i = y_i), \; \mathrm{Acc}_{\mathrm{macro}} = \frac{1}{|S|}\sum_{s=1}^{|S|} \mathrm{Acc}_s$

$\text{Macro-}F_1 = \frac{1}{4}\sum_{c=1}^4 F_{1,c}$

Other reported metrics: number of correct responses, per-section performance, inference time, and success against distractor/error types (Bayram et al., 2024, Bayram et al., 18 Aug 2025, Yüksel et al., 2024, Isbarov et al., 16 Feb 2025).

Prominent leaderboards indicate closed-source models such as GPT-4o and Claude-3.5 Sonnet achieving over 84% accuracy, outperforming open-source LLMs (e.g., llama3.3: 79.4%, gemma2: 72.1%) (Bayram et al., 2024, Bayram et al., 18 Aug 2025).

4. Tokenization: Intrinsic Metrics and Correlation with Performance

Unique among MMLU-style evaluations, TR-MMLU systematically quantifies tokenizer effectiveness—crucial given Turkish’s high morphemic productivity (Bayram et al., 10 Feb 2025):

Vocabulary size ( $|V|$ ): Total unique tokens.
Token count ( $T$ ): Total tokens generated when tokenizing the full TR-MMLU corpus.
Processing time ( $\tau$ ): Wall-clock time to tokenize the full dataset.
Turkish-specific token percentage (%TR):

$\% \mathrm{TR} = \frac{\# \text{valid Turkish tokens}}{T} \times 100$

Token purity (%Pure): Tokens corresponding to atomic morphemes (roots/valid affixes).

$\% \mathrm{Pure} = \frac{\# \text{morpheme-aligned tokens}}{T} \times 100$

Empirical analysis demonstrates that %TR is the strongest predictor of model performance on TR-MMLU ( $\rho_{\%TR,\text{accuracy}} = +0.90$ ), far surpassing raw model size or parameter count (Bayram et al., 10 Feb 2025). Models with Turkish-morpheme–informed segmentation outperform those using generic multilingual subword-tokenization by 3–7 percentage points (Bayram et al., 2024). Excess vocabulary fragmentation or slow tokenization negatively correlates with both %TR and downstream accuracy.

5. Error Analysis and Benchmark Validation

To address concerns regarding reliability, TR-MMLU draws from the error-correction and annotation protocols established for MMLU-Redux (Gema et al., 2024):

Hierarchical error annotation: Each question is independently validated by at least two domain experts. Ambiguities or disagreements are adjudicated by a third reviewer, with Cohen’s κ statistics reported for inter-annotator agreement.
Ground-truth validation: Each answer key is cross-checked with original question sources or authoritative references.
Taxonomy of errors: Explicit labeling of question clarity, options clarity, and answer key accuracy.
Sustained quality assurance: Automated pre-checks, source linking, and periodic review cycles to limit ground-truth drift.
Dataset versioning: Provenance and change logs are maintained for transparent updates (Gema et al., 2024).

This protocol yields an error rate substantially lower than the 6.5%–57% error rates observed in English MMLU subsets.

6. Turkish and Multilingual Context

While TR-MMLU is uniquely native to Turkish, it interfaces with several related multilingual benchmarks:

TUMLU: A parallel initiative covering eight Turkic languages (including Turkish) with questions natively authored by teacher–student communities—not by translation (Isbarov et al., 16 Feb 2025). The Turkish subset overlaps in scope and style with TR-MMLU, with robust cross-lingual validation.
Global MMLU: A 42-language expansion of MMLU using post-edited human translations (with Turkish community annotation), annotated for cultural/geographic sensitivity to counteract Western-centric bias (Singh et al., 2024). Culturally sensitive (CS) and culturally agnostic (CA) splits reveal large performance gaps and ranking volatility across translation approaches and subject domains.
TurkishMMLU: A larger benchmark (over 10,000 questions, 9 subjects) with curriculum-based, expert-authored items. It confirms similar findings regarding the necessity of native development and the inadequacy of translation-only approaches (Yüksel et al., 2024).

7. Challenges, Recommendations, and Future Directions

TR-MMLU reveals unique evaluation challenges for Turkish LLMs:

Agglutinative morphology: Standard subword tokenizers induce excessive fragmentation, leading to semantic loss and out-of-vocabulary errors. Tokenizers tailored to Turkish morphemic boundaries consistently improve accuracy (Bayram et al., 10 Feb 2025, Bayram et al., 2024).
Data scarcity: The absence of large-scale, domain-diverse Turkish datasets limits pretraining and transfer.
Fine-tuning risks: Domain-specific Turkish fine-tuning can cause catastrophic forgetting of general knowledge (Bayram et al., 2024).
Prompt sensitivity: Minor changes to Turkish instructions can swing LLM scores by 5–10 percentage points (Bayram et al., 18 Aug 2025).
Translation artifacts: Blind translation introduces translation-induced test failures, cultural mismatch, and proper-name mistranslations—necessitating rigorous, multi-stage human review (Plaza et al., 2024, Singh et al., 2024).

Best practices include maximizing %TR in tokenization, integrating morphological analyzers into tokenizer training, balancing vocabulary size, and using continual-learning mechanisms for fine-tuning. Reporting both culturally agnostic and culturally sensitive subsets, as in Global MMLU, enables more robust cross-lingual benchmarking.

Ongoing directions: Dataset enrichment (e.g., coverage of open-ended and generative tasks), richer section tagging, release of adaptation scripts/checkpoints, and expansion into informal language registers (Bayram et al., 2024, Isbarov et al., 16 Feb 2025, Yüksel et al., 2024, Bayram et al., 18 Aug 2025).

By uniting quality-controlled, natively composed questions, deep linguistic validation, and quantitative tokenization analysis, TR-MMLU establishes a rigorous, reproducible foundation for Turkish LLM evaluation that generalizes to other morphologically complex, low-resource languages (Bayram et al., 10 Feb 2025, Bayram et al., 2024, Bayram et al., 18 Aug 2025).