Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs (2507.14307v1)

Published 18 Jul 2025 in cs.CL

Abstract: LLMs exhibit increasingly sophisticated linguistic capabilities, yet the extent to which these behaviors reflect human-like cognition versus advanced pattern recognition remains an open question. In this study, we investigate how LLMs process the temporal meaning of linguistic aspect in narratives that were previously used in human studies. Using an Expert-in-the-Loop probing pipeline, we conduct a series of targeted experiments to assess whether LLMs construct semantic representations and pragmatic inferences in a human-like manner. Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, raising concerns about their ability to fully comprehend narratives. These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding. Beyond these empirical findings, we develop a standardized experimental framework for the reliable assessment of LLMs' cognitive and linguistic capabilities.

Summary

  • The paper presents an expert-in-the-loop probing pipeline to assess LLMs' comprehension of temporal aspects in narratives.
  • The experiments reveal that LLMs perform well on prototypical perfective events but struggle with imperfective events, affecting causal inference.
  • The study indicates that LLMs rely on linguistic prototypes, resulting in less human-like causal reasoning and inconsistent aspectual judgments.

LLMs' Temporal Comprehension in Narratives

This paper (2507.14307) investigates the extent to which LLMs comprehend the temporal meaning of linguistic aspect within narratives. The authors employ an Expert-in-the-Loop probing pipeline, adapting experimental methodologies from cognitive science to assess whether LLMs construct semantic representations and pragmatic inferences akin to human cognition. The paper reveals that LLMs exhibit an over-reliance on prototypical linguistic structures, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, indicating a fundamental difference in how LLMs process aspect compared to humans. The authors also introduce a standardized experimental framework to assess cognitive and linguistic capabilities of LLMs.

Experimental Design and Methodology

The paper is based on narratives previously used in human studies, focusing on the manipulation of grammatical aspect (perfective vs. imperfective) to influence causal inference. The narratives are structured to include two potential causes (C1C_1 and C2C_2) followed by an effect, with the aspect of C1C_1 varied to assess its impact on the perceived causality of the effect. Figure 1

Figure 1: We examine how LLMs understand differences in aspect by presenting LLMs with narratives that have a key word either in the imperfective or perfective (e.g., was passing'' vs.passed'') followed by comprehension probes adapted from previous human studies.

The Expert-in-the-Loop probing pipeline (Figure 2) is designed to facilitate controlled behavioral experiments with LLMs. It incorporates three core stages: prompting, prompt paraphrasing, and model inference. Prompts are constructed to mirror human experimental designs, and prompt paraphrasing introduces controlled variations to ensure robustness. The experiments are conducted across multiple LLMs to support model-agnostic behaviors. This pipeline is applied iteratively, with experts assessing intermediate results to gather converging evidence. Figure 2

Figure 2: A conceptual overview of our Expert-in-the-Loop probing pipeline for assessing cognitive abilities of LLMs, designed in close collaboration with domain experts from cognitive science.

Key Findings

The paper presents three main experiments: truth value judgments, word completion tasks, and open-ended causal questions.

Truth Value Judgments

The truth value judgment experiment probes the LLMs' ability to infer the completion of events in narratives. The results indicate that LLMs perform well with perfective events but struggle with imperfective events, demonstrating a difficulty with non-prototypical aspect usage. Figure 3

Figure 3

Figure 3: Accuracy in semantic truth-value judgments for events marked with imperfective aspect for LLMs, when the events are embedded within a narrative (shaded bars) versus not. For imperfective events, LLMs have much lower accuracy rates than humans when judging whether the event's resulting final state is valid. Further, LLMs seem to be heavily affected by the presence or absence of a narrative, especially when judging the negative polarity of final states. Notably, the presence or absence of a narrative changes responses in inconsistent directions. Error bars represent the standard error.

Word Completion Task

The word completion task assesses whether LLMs treat aspect in narratives as a temporary signal for retaining or encoding an event. While LLMs exhibit some alignment with human patterns when the word completion is near Cause 1, the frequency of target word responses decreases when the completion is placed near the Effect. This suggests a lack of distal causal narrative integration capabilities. Figure 4

Figure 4

Figure 4: Frequencies at which word completion rates match the target word from Cause 1 across models. Shaded bars are for imperfective aspect in Cause 1. LLM completions have significantly higher match frequencies when the probe directly follows Cause 1 (top) and are reduced after the effect (bottom).

Open-Ended Causal Questioning

The open-ended causal questioning examines whether LLMs' causal inferences are affected by aspect. The results show that LLMs are more likely to infer that Cause 1 caused the effect in the imperfective condition, similar to humans. However, LLMs make this inference less frequently than humans, and they are less likely to offer a perfective Cause 1 as their answer, indicating an over-reliance on prototypicality. Figure 5

Figure 5: As LLM parameter size increases, there is a trend towards more human-like causal inferences with respect to the Cause 1 event when Cause 1 is in the imperfective. When Cause 1 is in the perfective, LLMs are consistently below human causal inference rates.

Implications and Future Directions

The paper suggests that LLMs process aspect in narratives differently from humans due to a lack of pragmatic context-level understanding. The semantic exploration of truth value conditions connected with prototypical and non-prototypical pairings of events indicates that LLMs represent aspect distributionally rather than based on the concepts expressed by aspect.

The findings also reveal a disconnect between declarative knowledge and implicit application in LLMs. While LLMs can articulate the definition of aspect, they fail in implicit application tasks. Future research should explore how this tension extends to other linguistic and cognitive domains.

The authors contribute a generalizable experimental pipeline for assessing LLMs' cognitive capabilities. The pipeline (Figures 6 and 7) includes the creation of datasets with multiple stimuli groups. This framework facilitates future research on the cognitive evaluation of LLMs. Figure 6

Figure 6: We assume as little structure as possible for experimental datasets to allow for generalizability to other domains. Datasets consist of multiple groups of stimuli, which have different independent variable values (users can indicate which fields are independent variables). Human studies often compare metrics across different stimuli groups to draw conclusions about the effects of independent variables.

Figure 7

Figure 7

Figure 7

Figure 7: Important components of our web application for collaborating with cognitive scientists. Top: A navigation bar showing the pipeline that cognitive scientists complete to submit an experiment to the pipeline. Middle: Cognitive scientists can share task instructions and select which columns from their uploaded stimulus file should be included in the prompt (we then post-process the prompts and create 30 paraphrased versions). Bottom: Cognitive scientists identify independent variables from their uploaded stimulus file and define groups based on these independent variables.

Conclusion

The paper's comprehensive analysis of LLMs' cognitive capabilities in processing linguistic aspect reveals critical limitations in their narrative comprehension. The Expert-in-the-Loop probing pipeline and the experimental findings contribute to the growing understanding of LLM cognition, underscoring the need for further research into the cognitive foundations of LLM behavior.

X Twitter Logo Streamline Icon: https://streamlinehq.com