Narrative Cloze Test: Evaluating Script Learning

Updated 10 April 2026

Narrative Cloze Test is an evaluation protocol in NLP that tests systems’ ability to predict a missing event in an ordered sequence of narrative events.
It uses variants like the multiple-choice and rank-based tasks, employing datasets such as ROCStories to measure narrative coherence and causal reasoning.
Modeling approaches range from PMI-based and neural language models to graph neural networks, highlighting challenges in integrating global discourse and commonsense.

The Narrative Cloze Test (NCT) is a central evaluation protocol in script learning and narrative understanding within NLP, designed to probe systems’ ability to model event sequences and predict held-out events in narrative contexts. NCT and its successors, in particular the Story Cloze Test (SCT), serve as controlled benchmarks for assessing event prediction, script induction, and commonsense reasoning in narrative text.

1. Formal Definitions and Task Variants

The original Narrative Cloze Test, introduced by Chambers and Jurafsky (2008), is defined over “event chains”—ordered lists of predicate-argument events extracted from narrative text. One event is removed from the chain, either at random or the next event in order, and the system is tasked with predicting the missing event from among all possible candidates. Formally, given context $(e_1, e_2, …, e_{n-1})$ , the goal is to predict the held-out $e_n$ as $c^* = \arg\max_{c \in C} \mathrm{Score}(c|e_1, ..., e_{n-1})$ , where $C$ is a candidate set and $\mathrm{Score}$ is instantiated by event modeling approaches (Yao et al., 2018, Li et al., 2018).

The Multiple-Choice Narrative Cloze (MCNC) variant presents $k$ candidate events (often $k=5$ ), one of which is the gold next event, and the model must choose the correct one. The Story Cloze Test (Mostafazadeh et al., 2016) represents a major evolution: it operates at the sentence level on natural language stories. Given a four-sentence context $C = (s_1, s_2, s_3, s_4)$ and two candidate endings $\{e^+, e^-\}$ (gold and foil), the system predicts $\hat{y} = \arg\max_{e \in \{e^+, e^-\}} P(e|C)$ , where $e_n$ 0 is typically a model’s conditional probability or scoring function (Mostafazadeh et al., 2016, Chen et al., 2018, Liu et al., 2018).

2. Historical Development and Motivations

NCT was motivated by the problem of learning and evaluating “scripts”—formalizations of stereotypical event sequences encoding common world knowledge about how situations unfold. The Chambers & Jurafsky NCT paradigm initially framed evaluation in terms of ranking the omitted event (mean rank, recall@ $e_n$ 1), with heavy reliance on predicate–argument co-occurrence statistics and pointwise mutual information (PMI). However, researchers identified several drawbacks:

Simple frequency-based or PMI-based baselines produced “easy wins” (Mostafazadeh et al., 2016).
Abstract event tuple formats limited the complexity and naturalness of the reasoning involved.
Lack of reliable gold events, due to annotation noise and event extraction errors.

These issues motivated the SCT, which shifted to full-sentence, human-authored story continuations and a forced-choice format. The SCT’s design makes shallow statistical tricks and annotation error less effective, producing a more demanding benchmark for causal-temporal commonsense inference (Mostafazadeh et al., 2016, Liu et al., 2018).

3. Corpora and Evaluation Frameworks

The ROCStories corpus (Mostafazadeh et al., 2016) provides the canonical data resource for SCT. Its key properties:

49,255 five-sentence stories, crowdsourced for coherence, commonsense, and causal/temporal structure.
Each story forms a self-contained micro-narrative with clear protagonist and event progression.
For SCT, 1,871 stories are partitioned into development and test splits, each with two candidate endings; the remaining ~45K serve as LM or embedding model training data.

For event-based MCNC/NC benchmarks, the New York Times Gigaword corpus and similar newswire datasets are used to extract millions of event chains (Yao et al., 2018, Li et al., 2018). Test sets such as the Chambers & Jurafsky (2008) dataset, recast to MCNC format, provide 349 held-out instances with 5-way choices.

Evaluation metrics depend on test type:

Test Variant	Output	Metric
NCT (rank)	Event ID	Mean rank, recall@ $e_n$ 2
MCNC	Next event	Accuracy (%)
SCT	Ending sentence	Accuracy (%)

4. Modeling Approaches and Empirical Results

Several families of models are benchmarked on NCT, MCNC, and SCT:

Count- and Co-occurrence-Based: Early methods used unsupervised event clustering and PMI scoring of candidate events. In MCNC, pure PMI scoring using event pair statistics from automatically mined narratives achieves 48.83% accuracy, outperforming several LSTM and feedforward neural baselines on the same set (Yao et al., 2018).
Neural LLM Predictors: LSTM or transformer models over event sequences, sometimes augmented with dynamic memory or attention, improve event prediction. PairLSTM models achieve up to 50.8% on NYT multiple-choice NC (Li et al., 2018).
Event Graph Models: The Narrative Event Evolutionary Graph (NEEG) encodes event–event interactions as nodes and weighted edges. A Scaled Graph Neural Network (SGNN) propagates information among only the context and candidate nodes per instance, yielding 52.45% test accuracy in MCNC with attention, a +1.6 point absolute improvement over PairLSTM (Li et al., 2018).
Semantic Aspect Models for SCT: On SCT, models aggregate evidence from narrative sequence models (transformer or memory networks), sentiment trajectory predictors, and external commonsense knowledge (e.g., ConceptNet via Numberbatch embeddings). Modular fusion via softmax gating leads to state-of-the-art performance—87.6% on the ROCStories SCT, compared to 86.5% for strong narrative-only models (FTLM), and 71.0% for the best 2016 non-LM model (Chen et al., 2018, Mostafazadeh et al., 2016).
Semantic Memory Chains: Recurrent Entity Networks augmented with external memory chains supervised to focus on events, sentiment, or topic yield test accuracy of 78.5% on SCT (Liu et al., 2018). Ablations reveal that sentiment tracking is particularly important for coherent ending prediction.

Empirical results consistently find that narrative modeling alone is insufficient: models integrating sentiment and explicit background knowledge or reasoning achieve higher accuracy, but all lag far behind near-perfect human performance.

5. Analysis of Model Errors and Limitations

Analysis across multiple studies shows that most errors in SCT involve cases requiring multi-sentence, global causal inference—tracking protagonist goals, nested contingencies, or social reactions (Mostafazadeh et al., 2016, Chen et al., 2018). Shallow frequency statistics and basic LMs often prefer endings that are locally plausible but globally incoherent.

PMI- and co-occurrence-based approaches, still surprisingly competitive in MCNC, break down in SCT due to the richer representations and contrastive examples. Neural models can overfit to surface-level cues and struggle with negation and implicit background shifts (Chen et al., 2018). Semantic supervision is critical, but external resources (e.g., FrameNet, sentiment lexicons) introduce noise; event triggers from shallow semantic parses are particularly problematic (Liu et al., 2018).

A plausible implication is that robust script and story understanding requires architectures that integrate long-range discourse, world knowledge, and subtle pragmatic inference.

6. Implications for Commonsense Reasoning and Future Directions

SCT and its underlying corpora have become standard benchmarks for deeper story and script learning, question answering, and dialogue modeling (Mostafazadeh et al., 2016). The forced-choice design enables detailed diagnostic analysis of systems’ failures in causal ordering, temporal coherence, and social reasoning.

Research directions proposed include:

Increasing the number of distractors per context (moving beyond binary choice, e.g., 4-way SCT).
Introducing subtler contrastive endings (“hard negatives”) to test fine-grained inference.
Expanding to longer or more complex narratives (beyond five sentences), non-narrative genres, or dialogue-based story completion.
Employing architectures inspired by neuroscience (e.g., memory networks, attention mechanisms over explicit story states) (Mostafazadeh et al., 2016, Liu et al., 2018).

Identified challenges include the construction of more reliable event extraction and representation pipelines, principled integration of knowledge graphs (e.g., ConceptNet), and development of architectures that can dynamically allocate attention across narrative, emotional, and knowledge signals. Continued progress on the Cloze family of benchmarks is expected to drive advances in narrative reasoning, commonsense script learning, and large-scale language understanding.