Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not

Published 6 Apr 2026 in cs.CL and cs.AI | (2604.04825v1)

Abstract: LLMs achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution. We test this question in Turkish prenominal relative-clause attachment ambiguities, where the same surface string permits high attachment (HA) or low attachment (LA). We construct ambiguous items that keep the syntactic configuration fixed and ensure both parses remain pragmatically possible, while graded event plausibility selectively favors High Attachment vs.\ Low Attachment. The contrasts are validated with independent norming ratings. In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect. We then evaluate Turkish and multilingual LLMs in a parallel preference-based setup that compares matched HA/LA continuations via mean per-token log-probability. Across models, plausibility-driven shifts are weak, unstable, or reversed. The results suggest that, in the tested models, plausibility information does not guide attachment preferences as reliably as it does in human judgments, and they highlight Turkish RC attachment as a useful cross-linguistic diagnostic beyond broad benchmarks.

Abstract PDF Upgrade to Chat

Authors (1)

Sercan Karakaş

Summary

The paper demonstrates that native Turkish speakers shift HA responses from 26.3% to 65.2% with graded plausibility manipulations.
It employs a forced-choice comprehension task and norming studies to rigorously compare human and LLM responses to RC ambiguities.
LLMs showed deficient or reversed plausibility effects, exposing a gap in integrating world knowledge with syntactic processing.

Plausibility-Driven Attachment in Turkish RC Ambiguity: Divergence Between Human and LLM Behavior

Introduction

This paper presents a systematic investigation of whether LLMs utilize graded, context-sensitive plausibility cues as humans do during syntactic ambiguity resolution, focusing on Turkish prenominal relative clause (RC) attachment. The study is positioned within the psycholinguistic tradition of examining RC attachment ambiguities in typologically diverse settings, extending beyond the English-centric paradigm to a morphologically rich, relatively low-resource language. The central diagnostic is whether both humans and LLMs shift their attachment preference (high attachment [HA] vs. low attachment [LA]) in response to finely controlled plausibility manipulations when both parses remain pragmatically viable.

Background and Motivation

The ambiguity in Turkish prenominal RCs arises when a surface string permits multiple hierarchical parses—specifically, whether the RC modifies the higher or lower noun in a complex nominal structure. Existing literature documents that both broad structural heuristics (Late Closure, Right Association, Recency) and discourse/pragmatic context influence human resolution of such ambiguities. Critically, prior studies suggest event-level world knowledge—quantified as "plausibility"—provides a soft, graded bias in online attachment decisions.

In high-resource languages, neural LMs have been evaluated as psycholinguistic models using surprisal-based linking hypotheses. However, their potential for structure-sensitive, world-knowledge-guided disambiguation remains poorly understood outside of English and under fine-grained plausibility conditions. This study targets these gaps by constructing Turkish RC ambiguity materials with precise plausibility manipulations and by directly comparing native speaker performance with that of Turkish and multilingual LLMs.

Experimental Paradigm and Methodology

Human Behavioral Experiment

A speeded, forced-choice comprehension task was administered to 86 native Turkish speakers, post-screening for reasonable response latencies. Materials comprised 40 ambiguous Turkish prenominal RC items, 20 favoring HA and 20 favoring LA via graded event plausibility—as validated by norming studies—while holding syntactic configuration and broad pragmatic possibility constant. Each trial required participants to resolve RC attachment by answering a who-question targeting the ambiguous constituent.

Model-Based Evaluation

Three LLMs were evaluated: a Turkish GPT-2, a Turkish reasoning-adapted Qwen-based model (DeepSeek-R1-Distill-Qwen-1.5B-Turkish), and the instruction-tuned multilingual Qwen3-30B-A3B-Instruct. They were scored using a forced-choice next-token log-probability protocol aligned with the human ambiguity, where preference was operationalized via the mean log-probability difference ( $\Delta$ ) between HA and LA continuations; positive $\Delta$ indicates HA preference.

Results

Human Attachment Preferences

Humans exhibited a striking, directionally correct plausibility effect: the HA response rate increased from 26.3% in Low-WK contexts (LA favored by world knowledge) to 65.2% in High-WK contexts (HA favored), representing a 38.9 percentage point shift. This effect was highly reliable (logistic regression: $\beta_{WK}=1.65\pm0.11$ , $z=14.75$ , $p<10^{-50}$ ), corresponding to an odds ratio of over 5, and was tightly tracked by independent plausibility norming ratings ( $\rho=0.85$ , $p=2\times10^{-6}$ ).

LLM Attachment Preferences

None of the LLMs reproduced the robust, plausibility-driven shift found in humans:

Turkish GPT-2: Exhibited a rigid LA bias (HA: 30% in both WK conditions; $\Delta\mathrm{HA}=0$ ).
DeepSeek-R1-Distill-Qwen-1.5B-Turkish: Showed only a modest and statistically unreliable shift (HA: 60% in High-WK, 50% in Low-WK; $\Delta\mathrm{HA}=10$ ).
Qwen3-30B-A3B-Instruct: Displayed a pronounced overall bias toward HA and, critically, a reversed direction of plausibility effect in the tested subset (HA: 70% in High-WK vs. 90% in Low-WK; $\Delta\mathrm{HA}=-20$ ), which was statistically significant at the margin level but in the direction opposite to human psycholinguistic data.

Statistical analysis (Fisher's exact tests, continuous margin comparison) consistently confirmed the absence of a human-like, plausibility-driven attachment shift in the LLMs. In several instances, not only was the effect attenuated, but its direction contrasted with normative human behavior.

Theoretical and Practical Implications

The results delineate a fundamental dissociation: while LLMs encode considerable factual and commonsense knowledge, as evidenced by their aggregate benchmark performance, they do not reliably integrate graded event plausibility with syntactic structure in real-time ambiguity resolution when both interpretations are pragmatically tenable. This is not merely a reflection of absent world knowledge, but a deficit in its deployment as a constraint on syntactic disambiguation. The distinction between knowledge "in storage" (accessed in QA or guided completion) and "in use" (activated in incremental parsing and interpretation) is thus empirically supported.

This suggests strong theoretical limitations for the use of current LLMs as cognitive process models, especially concerning the compositional integration of contextual world knowledge with hierarchical syntax, particularly in typologically complex and morphologically rich languages. It also reinforces the diagnostic utility of RC attachment paradigms, especially when designed to probe graded rather than categorical pragmatic contrasts.

From an engineering perspective, strong aggregate performance on broad commonsense or linguistic benchmarks does not ensure human-like, structure-sensitive cue integration, at least for Turkish RC ambiguities. This has ramifications for LLM deployment in high-precision or language technology applications requiring nuanced, real-time comprehension in underrepresented languages.

Methodological Considerations and Future Directions

The failure modes observed (stable structural bias, weak modulation, reversed directionality) partially implicate the specifics of tokenization, model architecture, and training data underrepresentation in Turkish. The findings call for further studies varying prompt/continuation format, broadening both the model pool (including finetuned or instruction-optimized Turkish LLMs), and controlling for possible instruction-tuning artifacts such as over-peaked choice distributions.

Future research should implement more granular time-course proxies (e.g., region-wise, incremental surprisal) to further close the gap between process-level comparability in human and model studies. Extension to other languages, inclusion of additional psycholinguistic phenomena, and a deeper inquiry into the limits of "commonsense-in-use" in LLMs are necessitated by this dissociation.

Conclusion

In Turkish prenominal RC attachment, native speakers robustly utilize graded world knowledge to guide syntactic ambiguity resolution even when both parses are plausible. In contrast, contemporary LLMs—including both monolingual and advanced multilingual instruction-tuned models—fail to exhibit the human-like modulation of attachment preferences by plausibility, and in some cases show an anti-human effect. These results reveal a core limitation in the psycholinguistic adequacy of LLMs under subtle cue integration regimes and underscore the continued importance of targeted, diagnostic behavioral paradigms for both cognitive modeling and practical NLP in cross-linguistic contexts (2604.04825).