Papers
Topics
Authors
Recent
2000 character limit reached

AIPasta: Benchmark for Narrative State Reasoning

Updated 8 December 2025
  • AIPasta is a dataset and benchmark designed to evaluate AI systems' capacity to infer implicit participant states in short English narratives.
  • The methodology involves a four-step crowd annotation process: state inference, justification, counterfactual perturbation, and minimal story revision.
  • Key evaluation tasks include state inference, story revision, and state change explanation, with performance measured by accuracy, human acceptability, and explanation precision.

AIPasta refers to PASTA (“PA”rticipant “STA”tes), a crowdsourced dataset and benchmark suite designed for evaluating and advancing AI systems’ capacity to infer, revise, and explain the implicit and often unstated states of characters and objects within short English-language narratives. By modeling participant states—attributes or conditions not mentioned directly in the text but essential for narrative coherence—PASTA challenges existing natural language understanding models to transcend surface event prediction and perform free-form causal and counterfactual reasoning about narrative worlds (Ghosh et al., 2022).

1. Motivation and Conceptual Foundations

Narrative comprehension by humans involves rich, implicit inference about participant states: properties such as ownership, beliefs, emotional disposition, physical status, or numerical facts, which, though unstated, underpin the events and coherence of a story. Standard datasets focus on event order (“What happened next?”) or limited affective categories; they generally lack mechanisms to elicit or evaluate fine-grained inference about open-class, unstated properties, or the impact of their change under counterfactual scenarios. PASTA directly addresses these limitations by targeting three core reasoning capabilities:

  • Entailment: Inferring which participant states are supported by the narrative.
  • Counterfactual Revision: Editing stories to conform to alternative participant states.
  • State Change Explanation: Identifying the specific state transitions between story variants.

This focus on “participant states” provides a testing ground for models’ abilities in open-vocabulary commonsense reasoning, beyond lexical pattern-matching or constrained sentiment classes.

2. Dataset Construction and Annotation Protocol

PASTA is built upon the ROCStories corpus of everyday five-sentence stories. Each source story SS is annotated through a rigorous four-step procedure conducted by three independent crowd workers:

  1. State Inference (α\alpha): Annotators specify a free-form, unstated participant property (e.g., “Kate’s mother values cleanliness”). These states must not duplicate explicit actions or text content.
  2. Justification (JαSJ_\alpha^S): Annotators select the minimal subset of sentences in SS that suffice to plausibly support α\alpha.
  3. Counterfactual Perturbation (α\alpha'): Annotators formulate an alternative state highly unlikely given SS (e.g., “the soda’s lid was tight”).
  4. Minimal Revision (SSS \rightarrow S'): Annotators revise SS to SS' so α\alpha' is now entailed and α\alpha is excluded, altering as little as possible.

Quality control includes strict annotator qualifications and expert review. The final corpus contains 10,743 validated 4-tuples (S,α,α,S)(S, \alpha, \alpha', S'), partitioned into 8,476 train, 1,350 validation, and 917 test samples.

Key dataset characteristics:

Statistic Value Description
Unique stories 5,028 Distinct five-sentence narratives
Avg. tokens/state 5.7 (inferred), 6.0 (perturbed) Lexical brevity of states
Avg. justification sentences 1.5 Sentences supporting state inference
Avg. sentences edited (SSS \to S') 1.48/5 Minimal revisions ensure narrative coherence
Token overlap (S vs S') 90.3% High lexical retention in revision
Token overlap (α\alpha vs α\alpha') 71.9% Ensures semantic, not lexical, shift

States span physical (temperature, location), numerical (counts), societal (relationships, norms), and emotional/psychological (beliefs, desires) schemas.

3. Formal Task Specifications

PASTA defines three evaluation tasks with precise formalizations:

3.1 State Inference (Classification)

Binary decision: Is state α\alpha entailed by story SS via justification JJ? Let τ\tau be a decision threshold.

Entail(α,S,J)={1if P(αJ;S)τ 0otherwise\text{Entail}(\alpha, S, J) = \begin{cases} 1 & \text{if } P(\alpha|J; S) \geq \tau \ 0 & \text{otherwise} \end{cases}

Contrastive pairs are central: (S,J,α)(S,J,\alpha) is positive, (S,J,α)(S,J,\alpha') negative, with analogous construction for (S,J,)(S',J,\cdot).

3.2 Story Revision (Generative)

Given (S,α)(S, \alpha') with α\alpha' contradicting SS, produce a minimally revised SS' such that:

  • P(αS)P(αS)P(\alpha'|S') \gg P(\alpha'|S),
  • P(αS)0P(\alpha|S') \approx 0.

The model must locate and minimally rewrite the conflicting sentences in SS.

3.3 State Change Explanation (Generative)

Given (S,S)(S, S'), output (α,α)(\alpha, \alpha') such that:

  • P(αS)P(αS)P(\alpha|S) \gg P(\alpha|S'),
  • P(αS)P(αS)P(\alpha'|S') \gg P(\alpha'|S).

These represent the participant state altered by the revision and its counterfactual.

4. Model Benchmarks and Performance Analysis

The PASTA benchmark suite evaluates both discriminative and generative architectures:

  • Classification: BERT-base/large, RoBERTa-base/large (fine-tuned).
  • Generation: T5-base/large (fine-tuned); GPT-3 (text-davinci-002) via few-shot prompting.

Training and inference details: RoBERTa/BERT models use AdamW (lr=5×106\text{lr}=5 \times 10^{-6}) for 5–7 epochs; T5 variants use AdamW (lr=1×104\text{lr}=1 \times 10^{-4}) and nucleus sampling (p=0.93p=0.93); GPT-3 prompts range 5–15 examples, optimized by BERTScore validation.

Performance:

Task Metric Leader Baseline Value Human
State Inference Accuracy RoBERTa-large 90.6% (std), ~98%,
86.6% (contrastive) 89%
Story Revision Human Accept. T5-large 54%
GPT-3 ~49%
State Change Explanation Human Correct T5-large 56%
GPT-3 ~43%

Automatic proxies (BLEU/GLEU, ROUGE-L, BERTScore) correlate only moderately with human acceptability (r0.58r\approx 0.58), confirming the necessity of human evaluation.

5. Empirical Findings and Error Analysis

Analysis of model outputs reveals several persistent weaknesses:

  • Commonsense Diversity: Greatest difficulty observed for societal and numerical states (e.g., “they have five coworkers”), modest difficulty for emotional states with surface cues.
  • Logical Coherence: In story revision, approximately 30% of outputs are logically incoherent, 20% directly contradict the counterfactual, and 28% edit non-conflicting sentences.
  • Explanation Precision: For state change explanation, 37% of errors contradict the actual story, while 35% suggest changes irrelevant to the textual revision.

These results indicate that large-scale pretrained models, while effective on surface-level lexical tasks, have not yet achieved robust, generalizable reasoning over the wide spectrum of implicitly entailed states in open-form narratives.

6. Directions for Future Research

To address the identified challenges, recommended research directions include:

  • Development of unified models that explicitly integrate diverse commonsense knowledge sources, supplementing large-scale pretraining with structured physical, numerical, and factual information.
  • Exploration of interactive, feedback-driven generative paradigms—e.g., conversational LLMs that self-correct inconsistencies or logical flaws in narrative revisions.
  • Incorporation of symbolic or structured constraints, such as causal or state dependency networks, to enforce semantic coherence and correctness in generation tasks.

PASTA thus establishes a rigorous and multifaceted benchmark for open-vocabulary participant state reasoning, inviting advances at the intersection of commonsense, counterfactual, and narrative understanding (Ghosh et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to AIPasta.