- The paper demonstrates that native Turkish speakers shift HA responses from 26.3% to 65.2% with graded plausibility manipulations.
- It employs a forced-choice comprehension task and norming studies to rigorously compare human and LLM responses to RC ambiguities.
- LLMs showed deficient or reversed plausibility effects, exposing a gap in integrating world knowledge with syntactic processing.
Plausibility-Driven Attachment in Turkish RC Ambiguity: Divergence Between Human and LLM Behavior
Introduction
This paper presents a systematic investigation of whether LLMs utilize graded, context-sensitive plausibility cues as humans do during syntactic ambiguity resolution, focusing on Turkish prenominal relative clause (RC) attachment. The study is positioned within the psycholinguistic tradition of examining RC attachment ambiguities in typologically diverse settings, extending beyond the English-centric paradigm to a morphologically rich, relatively low-resource language. The central diagnostic is whether both humans and LLMs shift their attachment preference (high attachment [HA] vs. low attachment [LA]) in response to finely controlled plausibility manipulations when both parses remain pragmatically viable.
Background and Motivation
The ambiguity in Turkish prenominal RCs arises when a surface string permits multiple hierarchical parses—specifically, whether the RC modifies the higher or lower noun in a complex nominal structure. Existing literature documents that both broad structural heuristics (Late Closure, Right Association, Recency) and discourse/pragmatic context influence human resolution of such ambiguities. Critically, prior studies suggest event-level world knowledge—quantified as "plausibility"—provides a soft, graded bias in online attachment decisions.
In high-resource languages, neural LMs have been evaluated as psycholinguistic models using surprisal-based linking hypotheses. However, their potential for structure-sensitive, world-knowledge-guided disambiguation remains poorly understood outside of English and under fine-grained plausibility conditions. This study targets these gaps by constructing Turkish RC ambiguity materials with precise plausibility manipulations and by directly comparing native speaker performance with that of Turkish and multilingual LLMs.
Experimental Paradigm and Methodology
Human Behavioral Experiment
A speeded, forced-choice comprehension task was administered to 86 native Turkish speakers, post-screening for reasonable response latencies. Materials comprised 40 ambiguous Turkish prenominal RC items, 20 favoring HA and 20 favoring LA via graded event plausibility—as validated by norming studies—while holding syntactic configuration and broad pragmatic possibility constant. Each trial required participants to resolve RC attachment by answering a who-question targeting the ambiguous constituent.
Model-Based Evaluation
Three LLMs were evaluated: a Turkish GPT-2, a Turkish reasoning-adapted Qwen-based model (DeepSeek-R1-Distill-Qwen-1.5B-Turkish), and the instruction-tuned multilingual Qwen3-30B-A3B-Instruct. They were scored using a forced-choice next-token log-probability protocol aligned with the human ambiguity, where preference was operationalized via the mean log-probability difference (Δ) between HA and LA continuations; positive Δ indicates HA preference.
Results
Human Attachment Preferences
Humans exhibited a striking, directionally correct plausibility effect: the HA response rate increased from 26.3% in Low-WK contexts (LA favored by world knowledge) to 65.2% in High-WK contexts (HA favored), representing a 38.9 percentage point shift. This effect was highly reliable (logistic regression: βWK=1.65±0.11, z=14.75, p<10−50), corresponding to an odds ratio of over 5, and was tightly tracked by independent plausibility norming ratings (ρ=0.85, p=2×10−6).
LLM Attachment Preferences
None of the LLMs reproduced the robust, plausibility-driven shift found in humans:
- Turkish GPT-2: Exhibited a rigid LA bias (HA: 30% in both WK conditions; ΔHA=0).
- DeepSeek-R1-Distill-Qwen-1.5B-Turkish: Showed only a modest and statistically unreliable shift (HA: 60% in High-WK, 50% in Low-WK; ΔHA=10).
- Qwen3-30B-A3B-Instruct: Displayed a pronounced overall bias toward HA and, critically, a reversed direction of plausibility effect in the tested subset (HA: 70% in High-WK vs. 90% in Low-WK; ΔHA=−20), which was statistically significant at the margin level but in the direction opposite to human psycholinguistic data.
Statistical analysis (Fisher's exact tests, continuous margin comparison) consistently confirmed the absence of a human-like, plausibility-driven attachment shift in the LLMs. In several instances, not only was the effect attenuated, but its direction contrasted with normative human behavior.
Theoretical and Practical Implications
The results delineate a fundamental dissociation: while LLMs encode considerable factual and commonsense knowledge, as evidenced by their aggregate benchmark performance, they do not reliably integrate graded event plausibility with syntactic structure in real-time ambiguity resolution when both interpretations are pragmatically tenable. This is not merely a reflection of absent world knowledge, but a deficit in its deployment as a constraint on syntactic disambiguation. The distinction between knowledge "in storage" (accessed in QA or guided completion) and "in use" (activated in incremental parsing and interpretation) is thus empirically supported.
This suggests strong theoretical limitations for the use of current LLMs as cognitive process models, especially concerning the compositional integration of contextual world knowledge with hierarchical syntax, particularly in typologically complex and morphologically rich languages. It also reinforces the diagnostic utility of RC attachment paradigms, especially when designed to probe graded rather than categorical pragmatic contrasts.
From an engineering perspective, strong aggregate performance on broad commonsense or linguistic benchmarks does not ensure human-like, structure-sensitive cue integration, at least for Turkish RC ambiguities. This has ramifications for LLM deployment in high-precision or language technology applications requiring nuanced, real-time comprehension in underrepresented languages.
Methodological Considerations and Future Directions
The failure modes observed (stable structural bias, weak modulation, reversed directionality) partially implicate the specifics of tokenization, model architecture, and training data underrepresentation in Turkish. The findings call for further studies varying prompt/continuation format, broadening both the model pool (including finetuned or instruction-optimized Turkish LLMs), and controlling for possible instruction-tuning artifacts such as over-peaked choice distributions.
Future research should implement more granular time-course proxies (e.g., region-wise, incremental surprisal) to further close the gap between process-level comparability in human and model studies. Extension to other languages, inclusion of additional psycholinguistic phenomena, and a deeper inquiry into the limits of "commonsense-in-use" in LLMs are necessitated by this dissociation.
Conclusion
In Turkish prenominal RC attachment, native speakers robustly utilize graded world knowledge to guide syntactic ambiguity resolution even when both parses are plausible. In contrast, contemporary LLMs—including both monolingual and advanced multilingual instruction-tuned models—fail to exhibit the human-like modulation of attachment preferences by plausibility, and in some cases show an anti-human effect. These results reveal a core limitation in the psycholinguistic adequacy of LLMs under subtle cue integration regimes and underscore the continued importance of targeted, diagnostic behavioral paradigms for both cognitive modeling and practical NLP in cross-linguistic contexts (2604.04825).