AIPasta: Benchmark for Narrative State Reasoning
- AIPasta is a dataset and benchmark designed to evaluate AI systems' capacity to infer implicit participant states in short English narratives.
- The methodology involves a four-step crowd annotation process: state inference, justification, counterfactual perturbation, and minimal story revision.
- Key evaluation tasks include state inference, story revision, and state change explanation, with performance measured by accuracy, human acceptability, and explanation precision.
AIPasta refers to PASTA (“PA”rticipant “STA”tes), a crowdsourced dataset and benchmark suite designed for evaluating and advancing AI systems’ capacity to infer, revise, and explain the implicit and often unstated states of characters and objects within short English-language narratives. By modeling participant states—attributes or conditions not mentioned directly in the text but essential for narrative coherence—PASTA challenges existing natural language understanding models to transcend surface event prediction and perform free-form causal and counterfactual reasoning about narrative worlds (Ghosh et al., 2022).
1. Motivation and Conceptual Foundations
Narrative comprehension by humans involves rich, implicit inference about participant states: properties such as ownership, beliefs, emotional disposition, physical status, or numerical facts, which, though unstated, underpin the events and coherence of a story. Standard datasets focus on event order (“What happened next?”) or limited affective categories; they generally lack mechanisms to elicit or evaluate fine-grained inference about open-class, unstated properties, or the impact of their change under counterfactual scenarios. PASTA directly addresses these limitations by targeting three core reasoning capabilities:
- Entailment: Inferring which participant states are supported by the narrative.
- Counterfactual Revision: Editing stories to conform to alternative participant states.
- State Change Explanation: Identifying the specific state transitions between story variants.
This focus on “participant states” provides a testing ground for models’ abilities in open-vocabulary commonsense reasoning, beyond lexical pattern-matching or constrained sentiment classes.
2. Dataset Construction and Annotation Protocol
PASTA is built upon the ROCStories corpus of everyday five-sentence stories. Each source story is annotated through a rigorous four-step procedure conducted by three independent crowd workers:
- State Inference (): Annotators specify a free-form, unstated participant property (e.g., “Kate’s mother values cleanliness”). These states must not duplicate explicit actions or text content.
- Justification (): Annotators select the minimal subset of sentences in that suffice to plausibly support .
- Counterfactual Perturbation (): Annotators formulate an alternative state highly unlikely given (e.g., “the soda’s lid was tight”).
- Minimal Revision (): Annotators revise to so is now entailed and is excluded, altering as little as possible.
Quality control includes strict annotator qualifications and expert review. The final corpus contains 10,743 validated 4-tuples , partitioned into 8,476 train, 1,350 validation, and 917 test samples.
Key dataset characteristics:
| Statistic | Value | Description |
|---|---|---|
| Unique stories | 5,028 | Distinct five-sentence narratives |
| Avg. tokens/state | 5.7 (inferred), 6.0 (perturbed) | Lexical brevity of states |
| Avg. justification sentences | 1.5 | Sentences supporting state inference |
| Avg. sentences edited () | 1.48/5 | Minimal revisions ensure narrative coherence |
| Token overlap (S vs S') | 90.3% | High lexical retention in revision |
| Token overlap ( vs ) | 71.9% | Ensures semantic, not lexical, shift |
States span physical (temperature, location), numerical (counts), societal (relationships, norms), and emotional/psychological (beliefs, desires) schemas.
3. Formal Task Specifications
PASTA defines three evaluation tasks with precise formalizations:
3.1 State Inference (Classification)
Binary decision: Is state entailed by story via justification ? Let be a decision threshold.
Contrastive pairs are central: is positive, negative, with analogous construction for .
3.2 Story Revision (Generative)
Given with contradicting , produce a minimally revised such that:
- ,
- .
The model must locate and minimally rewrite the conflicting sentences in .
3.3 State Change Explanation (Generative)
Given , output such that:
- ,
- .
These represent the participant state altered by the revision and its counterfactual.
4. Model Benchmarks and Performance Analysis
The PASTA benchmark suite evaluates both discriminative and generative architectures:
- Classification: BERT-base/large, RoBERTa-base/large (fine-tuned).
- Generation: T5-base/large (fine-tuned); GPT-3 (text-davinci-002) via few-shot prompting.
Training and inference details: RoBERTa/BERT models use AdamW () for 5–7 epochs; T5 variants use AdamW () and nucleus sampling (); GPT-3 prompts range 5–15 examples, optimized by BERTScore validation.
Performance:
| Task | Metric | Leader Baseline | Value | Human |
|---|---|---|---|---|
| State Inference | Accuracy | RoBERTa-large | 90.6% (std), | ~98%, |
| 86.6% (contrastive) | 89% | |||
| Story Revision | Human Accept. | T5-large | 54% | — |
| GPT-3 | ~49% | — | ||
| State Change Explanation | Human Correct | T5-large | 56% | — |
| GPT-3 | ~43% | — |
Automatic proxies (BLEU/GLEU, ROUGE-L, BERTScore) correlate only moderately with human acceptability (), confirming the necessity of human evaluation.
5. Empirical Findings and Error Analysis
Analysis of model outputs reveals several persistent weaknesses:
- Commonsense Diversity: Greatest difficulty observed for societal and numerical states (e.g., “they have five coworkers”), modest difficulty for emotional states with surface cues.
- Logical Coherence: In story revision, approximately 30% of outputs are logically incoherent, 20% directly contradict the counterfactual, and 28% edit non-conflicting sentences.
- Explanation Precision: For state change explanation, 37% of errors contradict the actual story, while 35% suggest changes irrelevant to the textual revision.
These results indicate that large-scale pretrained models, while effective on surface-level lexical tasks, have not yet achieved robust, generalizable reasoning over the wide spectrum of implicitly entailed states in open-form narratives.
6. Directions for Future Research
To address the identified challenges, recommended research directions include:
- Development of unified models that explicitly integrate diverse commonsense knowledge sources, supplementing large-scale pretraining with structured physical, numerical, and factual information.
- Exploration of interactive, feedback-driven generative paradigms—e.g., conversational LLMs that self-correct inconsistencies or logical flaws in narrative revisions.
- Incorporation of symbolic or structured constraints, such as causal or state dependency networks, to enforce semantic coherence and correctness in generation tasks.
PASTA thus establishes a rigorous and multifaceted benchmark for open-vocabulary participant state reasoning, inviting advances at the intersection of commonsense, counterfactual, and narrative understanding (Ghosh et al., 2022).