TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Published 16 Jun 2025 in cs.CL | (2506.13487v1)

Abstract: We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual LMs. Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces TurBLiMP, a benchmark that evaluates 16 linguistic phenomena in Turkish using 1000 minimal pairs per phenomenon.
It employs comprehensive experiments comparing state-of-the-art language models with human judgments to assess syntactic and morphological performance.
The findings reveal that advanced models struggle with Turkish agglutination and flexible word order, highlighting areas for future language model improvements.

An Overview of TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

TurBLiMP represents a significant advancement in the evaluation of LLMs, particularly for the Turkish language. The benchmark introduces linguistic minimal pairs, a method that has been foundational in linguistic diagnostics since the mid-20th century. This approach has gained traction in evaluating LLMs, providing a systematic method to probe their understanding of syntactic and morphological phenomena.

Design and Scope

TurBLiMP is meticulously crafted to assess 16 linguistic phenomena pervasive in Turkish, encapsulating complexities such as word order flexibility and morphological processes like subordination. Each phenomenon is represented by 1000 minimal pairs, making TurBLiMP a comprehensive resource for testing LLMs across diverse syntactic and morphological scenarios. The benchmark fills a critical gap, as the syntactic evaluation resources for Turkish have been remarkably sparse, especially when compared to high-resource languages like English.

Experimental Findings

The paper reports comprehensive experiments on a wide array of LLMs, including monolingual and multilingual variants. It highlights that state-of-the-art LLMs face difficulties with Turkish grammatical phenomena which are trivial for native speakers. This observation underscores that even advanced models exhibit discrepancies in handling agglutination and flexible word order—typical characteristics of Turkish.

The experimental testing extends beyond baseline comparisons, incorporating human acceptability judgments to provide a nuanced understanding of model performance against human standards. This inclusion is crucial as it illuminates areas where models diverge from actual speaker intuition, particularly in detecting subtle linguistic contrasts.

Implications and Future Directions

The implications of TurBLiMP are multifaceted. Practically, it offers a robust, typologically diverse evaluation framework that can guide the development of more sophisticated, linguistically aware models for underrepresented languages. Theoretically, TurBLiMP paves the way for deeper inquiry into model capabilities concerning morphological and syntactic generalization, particularly in agglutinative contexts like Turkish.

Looking forward, TurBLiMP could catalyze research into improving multilingual LLMs to counteract the performance deficits observed in languages with complex morphological structures. Future development might explore enhanced training data augmentation techniques or refinements in sub-word tokenization strategies specifically tailored to agglutinative languages.

Conclusion

TurBLiMP stands as a pioneering effort in broadening the typological spectrum of LLM evaluation benchmarks. Its rigorous design and extensive scope provide invaluable insights into the linguistic capabilities of LLMs with respect to Turkish. As natural language processing endeavors progress, benchmarks like TurBLiMP will be instrumental in ensuring linguistic inclusivity and enhancing the syntactic acumen of LLMs across diverse language landscapes.

Markdown Report Issue