- The paper shows that Turkish speakers adjust evidential morphology based on trust, with High-Trust contexts eliciting more direct evidence markers (-DI).
- The study finds that LLMs often default to base-rate biases, displaying limited sensitivity to contextual trust manipulations.
- Quantitative experiments (e.g., OR = 2.029, p = 7.91×10⁻¹²) highlight a robust gap between human evidential reasoning and current LLM performance.
Source-Sensitive Evidential Reasoning in Turkish: Benchmarking Humans and LLMs
Evidentiality in Turkish and Theoretical Framework
Evidentiality, the linguistic marking of information source, is prominent in Turkish through morphological distinctions, particularly between past-tense suffixes -DI and -mIş. The -DI form is canonically linked to direct evidence or stronger speaker commitment, while -mIş typically conveys indirect evidence such as inference or hearsay. However, the distribution and interpretation of these forms are context-sensitive and shaped by discourse factors, aligning with theoretical work that distinguishes between veridical and non-veridical commitment states and emphasizes the role of source trustworthiness in evidential selection.
The paper applies the Giannakidou-Mari framework, which decouples source information from speaker commitment, treating evidential marker choice as a function of trust and epistemic bias rather than categorical classification. The experimental manipulation isolates the effect of perceived reliability of an explicitly external information source (High-Trust vs. Low-Trust), directly testing whether Turkish speakers modulate evidential morphology based on trust cues.
Human Production Experiment: Robust Trust Effects
A cloze-style fill-in-the-blank experiment with native Turkish speakers interrogated evidential suffix choice under manipulated trust conditions. High-Trust contexts (e.g., official municipal alerts) elicited significantly more -DI completions, whereas Low-Trust contexts (e.g., informal neighbor reports) yielded more -mIş completions. The effect persisted across strict and lenient coding schemes, and was robust in stratified analyses controlling for content. Odds ratios and statistical significance levels (e.g., OR = 2.029, p = 7.91 × 10⁻¹²) demonstrate that Turkish evidential morphology is sensitive not only to the source type but also to the perceived reliability, supporting a commitment-based account of evidentiality.
This outcome imposes critical constraints on the Turkish evidentiality debate: source attribution alone is insufficient to explain morphological selection. Instead, trust in the external source modulates the speaker’s evidential commitment profile, a result consistent with analyses that prioritize the interaction between evidentiality, trust, and pragmatic reasoning over rigid grammatical labeling.
LLM Evaluation: Paradigm and Results
The paper benchmarks ten LLMs (both Turkish-focused and multilingual) in three prompting regimes: open gap-fill, explicit past-tense gap-fill, and forced-choice (A/B selection between -DI and -mIş forms). The datasets systematically manipulate source trust across 200 items.
Experiment I: Open Gap-Fill
Model usability varied markedly; usable evidential outputs ranged from 64.5% (Gemma-3-27B-IT) to 0.0% (Trendyol-LLM-7B). Most models exhibited strong default biases toward -DI, and when trust-consistent effects appeared, they were generally weak, unstable, or even reversed (e.g., Orbita-v0.1 showed a significant effect in the opposite direction, p = 0.018). Compliance and output-format limitations dominated contextual sensitivity.
Experiment II: Explicit Past-Tense Prompting
Explicit instructions improved model evidential compliance for some systems, notably Gemma-3-27B-IT, which demonstrated a reliable decrease in DI share from 97.9% (High) to 84.5% (Low), (ADI = +13.4 pp, p = 1.51 × 10⁻³). However, most models still failed to robustly track the trust manipulation, with only minor or absent shifts.
Experiment III: Forced-Choice A/B Selection
Forced-choice prompting induced high response determinism—replicate agreement >90%—while surfacing modest, directionally correct trust effects in several models (Gemma-2-9B-IT-TR: App = 16.0 pp). Nevertheless, baseline option preferences overwhelmed contextual adaptation; mapping accuracy remained near-chance, and no per-model effect survived Holm correction for multiple testing.
Exploratory Attention Analysis
Head-level attention modulation in Gemma models revealed only localized, weak changes in trust-cue attention span, with Gemma-3-27B-IT showing more consistent redistribution in higher layers. This provides a qualitative diagnostic of internal sensitivity, but not strong mechanistic evidence for robust trust encoding.
Implications, Limitations, and Future Directions
The findings underscore a sharp human-LLM gap in context-sensitive evidential reasoning. While Turkish speakers integrate source trust into morphological choice, current LLMs are largely insensitive to this contextual cue, defaulting to base-rate preferences or prompt-dependent behaviors. This gap persists even under constrained response formats, challenging claims that LLMs have achieved human parity in pragmatic linguistic tasks.
Practically, this limits the deployment of Turkish LLMs in applications that require source-sensitive reasoning (retrieval-augmented generation, citation-based QA, legal and journalistic summarization), as trust cues are not reliably encoded or reflected in outputs. Theoretically, the experimental results reinforce the need to model commitment and epistemic stance as dynamic, context-driven constructs, informing the direction of future semantic-pragmatic research.
Limitations include the restriction of direct human evaluation to the smaller dataset, LLM output-format issues (especially in open generation), and potential forced-choice response biases. The attention analysis is preliminary and restricted to select models.
Future research should address compositional generalization across agglutinative inflectional systems, scale human studies, and explore architectural or training interventions that improve morphosyntactic/contextual adaptation in LLMs. Controlled benchmarking against human baselines, particularly for semantics-pragmatics phenomena, remains imperative for model evaluation and design.
Conclusion
The paper provides compelling evidence that Turkish evidential morphology in past tense is robustly modulated by source trustworthiness for humans, but is incompletely and inconsistently tracked by contemporary LLMs across multiple prompting regimes. The results advance theoretical perspectives on evidentiality, contribute practical evaluation resources, and motivate systematic investigation of trust-sensitive reasoning in LLMs, highlighting the need for improved modeling of epistemic commitment and contextual adaptation in Turkish and other evidential languages (2604.24665).