Papers
Topics
Authors
Recent
2000 character limit reached

TrGLUE: Turkish NLU Benchmark

Updated 2 January 2026
  • TrGLUE is a benchmark designed to evaluate Turkish natural language understanding, featuring eight tasks that address agglutinative morphology and cultural nuances.
  • It builds tasks from Turkish-native corpora using a semi-automated annotation pipeline coupled with expert human validation to ensure linguistic naturalness.
  • Baseline evaluations with transformer-based models like BERTurk demonstrate competitive performance, highlighting TrGLUE's effectiveness in NLU evaluation.

TrGLUE (Turkish General Language Understanding Evaluation) is a standardized benchmark designed to evaluate natural language understanding (NLU) in Turkish, mirroring the scope of the established English GLUE benchmark while addressing the linguistic properties specific to Turkish. It comprises eight core tasks—four single-sentence classification, three sentence-pair classification, and one regression task—drawn from Turkish-native corpora and labeled via a reproducible, semi-automated annotation workflow. TrGLUE aims to provide a robust, reproducible framework for calibrating neural models, including transformer-based pre-trained LLMs, on a wide spectrum of Turkish NLU tasks (Altinok, 26 Dec 2025).

1. Design Motivation and Benchmark Scope

TrGLUE was developed to address the absence of a comprehensive NLU benchmark for Turkish, in analogy to GLUE (English), CLUE (Chinese), FLUE (French), and JGLUE (Japanese). Existing approaches relying on translation from English benchmarks do not account for Turkish’s agglutinative morphology, pro-drop syntax, and cultural context. TrGLUE is constructed from native text sources and task formulations designed to ensure linguistic naturalness, thereby mitigating translation artifacts and enabling culturally and morphologically appropriate evaluation (Altinok, 26 Dec 2025).

The eight core task types in TrGLUE include:

  • Single-sentence classification: TrCoLA (acceptability), TrSST-2 (binary sentiment)
  • Sentence-pair classification: TrMRPC (paraphrase), TrQQP (paraphrase in Q&A), TrMNLI (three-way NLI), TrQNLI (QA-turned-NLI), TrRTE (binary entailment)
  • Regression: TrSTS-B (semantic textual similarity).

The benchmark explicitly omits WNLI due to Turkish’s syntactic and morphological features that preclude the ambiguity central to Winograd-style challenges.

2. Corpus Sources and Task Construction

Each TrGLUE task is constructed from Turkish-native sources tailored to its objective:

  • TrCoLA (acceptability): Draws on sentences from Turkish linguistics textbooks; negative (ungrammatical) variants generated via LLM prompts.
  • TrSST-2 (sentiment): Movie reviews scraped from Sinefil.com and Beyazperde.com, with star ratings binarized for polarity.
  • TrMRPC (paraphrase): Generated from Havadis news corpus using a multi-stage filtering pipeline (lemma overlap, classifier scoring, and human audits).
  • TrSTS-B (similarity): Derived from translated STS-B English sentence pairs, refined via “translate-then-edit” by native speakers.
  • TrQQP (Q&A paraphrase): Compiles data from six Turkish Q&A forums using metadata and retrieval-based positive/negative pairing with extensive manual auditing.
  • TrMNLI (NLI): Premises sourced from the BellaTurca corpus; hypotheses generated by instruction-tuned LLMs across genres and validated by experts.
  • TrQNLI (QA-turned-NLI): Constructs NLI pairs from generated question-answer triplets on Turkish Wikipedia spanning a broad topical range.
  • TrRTE (binary entailment): Pairs licensed news or academic text premises with manually written hypotheses under several stylistic prompts.

The data sizes range from ≈2,500 (TrSTS-B) to over 300,000 (TrQQP), with rigorous split for training, development, and test sets.

Task Data Source(s) Size (Train/Dev/Test)
TrCoLA Linguistics textbooks 7.9K / 1K / 1K
TrSST-2 Movie reviews 60K / 8.9K / 8.9K
TrMRPC News articles 3.18K / 1K / 1K
TrSTS-B STS-B translations 2.46K / 0.6K / —
TrQQP Q&A platforms 309K / 30K / 30K
TrMNLI BellaTurca corpus 165K / 18.4K / 18.4K
TrQNLI Turkish Wikipedia 120K / 10K / 10K
TrRTE News/academia 3.78K / 1K / 1K

3. Annotation and Quality Control Pipeline

TrGLUE employs a semi-automated annotation pipeline termed "disagreement-driven triage":

  • LLM-Based Pre-Labeling: For each instance, a sentence-transformer classifier (trained on translated seeds) assigns coarse probability estimates to candidate labels. Simultaneously, an instruction-tuned LLM (Snowflake Arctic) generates label hypotheses under controlled constraints for diversity and temperature.
  • Threshold-Driven Auto-Labeling: High-confidence outputs (classifier p0.9p \geq 0.9 or 0.1\leq 0.1) are provisionally auto-labeled and randomly audited.
  • Cross-Model Agreement Checks: Disagreements between classifier and LLM, or low-confidence cases, are escalated for manual annotation. This strategy concentrates expert effort on ambiguous, edge, or difficult cases.
  • Human Validation: Native Turkish annotators review and correct machine-generated or tentative labels, ensuring language naturalness and orthographic soundness. Multiple annotator consensus is required for certain tasks (e.g., TrCoLA demands at least three out of four agreement, Krippendorff's α = 0.91).

Task-specific pipeline adaptations include full human authorship (TrCoLA and TrRTE hypotheses), label derivation from star ratings with spot checks (TrSST-2), and human-edited translations for TrSTS-B.

4. Evaluation Protocols and Metrics

TrGLUE adheres closely to the GLUE benchmark’s data split and metric conventions, supporting direct compatibility with GLUE infrastructure (notably Hugging Face’s run_glue.py and evaluation utilities). Selected metrics include:

MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\mathrm{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

  • TrSST-2, TrMRPC, TrQQP: Binary classification accuracy and F1 score.
  • TrSTS-B: Pearson correlation (rr) and Spearman correlation (ρ\rho) between predicted and gold-standard similarities.
  • TrMNLI: Accuracy on matched (in-domain) and mismatched (cross-domain) sets.
  • TrQNLI, TrRTE: Classification accuracy.

This design ensures direct comparability with international benchmarks and facilitates rapid benchmarking of transformer-based systems for Turkish.

5. Baseline Model Performance

Baseline results are reported for BERTurk (Turkish-pretrained BERT-base: 12 layers, 768 hidden, 12 heads, 32k WordPiece vocabulary), along with reference English GLUE results for BERT-base and RoBERTa-base. Using the standard fine-tuning recipe, BERTurk achieves performance in line with GLUE-style expectations, though task-specific variances reveal aspects of Turkish morpho-syntax and the dataset construction process.

Task BERTurk Metric(s) BERT (En) RoBERTa (En)
TrCoLA 42 MCC 52.1 59.8
TrSST-2 87.4/91.3 acc/F1 91.6/91.9 94.2/94.3
TrMRPC 74.4/72.7 acc/F1 78.4/85.4 88.9/92.0
TrSTS-B 71.3/69.9 r/ρ 87.2/86.9 90.5/90.2
TrQQP 95.4/94.3 acc/F1 90.1/86.7 90.8/88.7
TrMNLI 87.9/90.8 matched/mismatched 84.6/83.7 86.9/87.3
TrQNLI 90.6 acc 90.5 92.3
TrRTE 92.2 acc 67.8 75.4

Fractional-data learning curves indicate most classification tasks achieve at least 95% of full-data performance by using 60–80% of training examples. TrRTE shows an earlier and higher saturation rate in Turkish than in English, attributed to its natively constructed examples (Altinok, 26 Dec 2025).

6. Linguistic Analyses and Quality Outcomes

The corpus selection and annotation protocols emphasize a native-first strategy, eschewing full-scale translation except for TrSTS-B. This approach sustains Turkish-specific grammatical features, cultural context, and tokenization properties:

  • Over 90% of examples are retained for most tasks when compared to translation-heavy pipelines.
  • High inter-annotator agreement is achieved (TrCoLA α = 0.91), alongside substantial session-level consistency (SentiTurca hate speech ICC = 0.912).
  • Analyses of morpho-syntactic features highlight the complexity of subword fragmentation (BERTurk: 1.58 subwords/token vs. 1.29 for English BERT), dependency distances, named-entity density, and the prevalence of non-canonical word orders.
  • Turkish’s pro-drop phenomena (73.6% of finite clauses), flexible word order (3% OSV/OVS/VSO/VOS), and agglutination justify tailored filtering strategies in task construction.

A plausible implication is that benchmarks designed with this degree of linguistic fidelity may provide truer estimates of model performance and data efficiency in morphologically rich, low-resource languages.

7. Reproducibility and Impact

TrGLUE provides a scalable, reproducible blueprint for NLU evaluation in Turkish and languages with similar properties. By releasing all prompts, processing scripts, and full data provenance, TrGLUE offers transparency for future dataset curation and adaptation. The semi-automated pipeline harmonizes the scalability of LLM-based annotation with high-quality human validation, balancing cost and accuracy while ensuring cultural and linguistic appropriateness (Altinok, 26 Dec 2025).

TrGLUE fills a substantive gap in Turkish NLP infrastructure, supporting robust evaluation of transformer-based and LLMs, and offers methodological guidance for future benchmarks in other low-resource, morphologically complex languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to TrGLUE.