NarraBench Taxonomy Framework

Updated 9 April 2026

NarraBench Taxonomy is a comprehensive, theory-informed framework for analyzing and benchmarking narrative understanding tasks in NLP, especially for large language models.
It organizes narrative phenomena into four dimensions—Story, Narration, Discourse, and Situatedness—with fifty defined aspects and evaluation axes for precise mapping.
The framework’s survey of 78 benchmarks reveals significant gaps in narrative evaluation and guides future research towards multimodal and language-diverse metrics.

NarraBench Taxonomy provides a comprehensive, theory-informed framework for the analysis and benchmarking of narrative understanding tasks in NLP, with specific focus on LLMs. Drawing upon foundational work in classical narratology and contemporary narrative theory, NarraBench organizes narrative phenomena into distinct dimensions, enumerates fine-grained evaluative aspects, and introduces a formal schema for categorizing and aligning narrative benchmarks. Its accompanying survey covers 78 benchmarks and reveals substantial gaps in current evaluation of core narrative competencies, with notably low coverage for events, style, perspective, revelation, and subjective or perspectival judgment (Hamilton et al., 10 Oct 2025).

1. Theoretical Foundations and High-Level Organization

NarraBench derives its top-level taxonomy from four “Big-4” narratological dimensions: Story, Narration, Discourse, and Situatedness. These are motivated by Genette’s (1980) structuralist “triangle” (story, discourse, narration) and expanded with Herman’s (2009) “situatedness” construct to foreground social and paratextual context. The taxonomy thus comprehensively indexes narrative phenomena by distinguishing between the content of events (“story”), the structure and means of telling (“narration”), the temporal and informational organization (“discourse”), and contextual/paratextual attributes (“situatedness”). Each dimension subsumes multiple primary features, which in turn decompose into fifty precisely defined aspects, reflecting both classical theory and task definitions from prior NLP work.

2. Taxonomic Structure: Dimensions, Features, and Aspects

Each of the four dimensions is expanded into a structured hierarchy of features and aspects:

Story:
- Agents: name, role, attributes, emotions, motivation
- Social Networks: interactions, connections, relationships
- Events: event identification, schemas, causal framing
- Plot: thematic concerns, summary, subplots, conflicts, archetypes
- Structure: arcs (fortune, reversal, denouement)
- Setting: local/global locations, chronotope
Narration:
- Perspective: point of view (1P/2P/3P), focalization, dialogue attribution
- Style: allusion detection, figurative language, imageability, syntactic complexity, evaluative language
Discourse:
- Time: duration, ordering (flashbacks, foreshadowing)
- Revelation: suspense, curiosity, surprise as functions of information flow
Situatedness:
- Paratext: genre, author, date, medium, platform
- Motivation: authorial intent, purpose

Each aspect is annotated along three orthogonal evaluation axes: Scale (local/meso/global), Mode (discrete/progressive/holistic), and Variance (deterministic/consensus/perspectival), enabling precise definition and discipline-wide alignment of narrative evaluation tasks (Hamilton et al., 10 Oct 2025).

3. Formal Schema and Notation

NarraBench formalizes task specification as a 6-tuple:

$T = (\text{Dimension}, \text{Feature}, \text{Aspect}, \text{Scale}, \text{Mode}, \text{Variance})$

This schema captures the theoretical, structural, and practical parameters necessary for categorizing narrative understanding tasks. Event modeling utilizes event tuples of the form:

$E_i = (a_i,\, act_i,\, o_i,\, loc_i,\, t_i,\, c_i)$

where $a_i$ = agent, $act_i$ = action, $o_i$ = object, $loc_i$ = location, $t_i$ = time, $c_i$ = cause.

Perspective is encoded as a function over segments $s$ , $pov(s) \in \{1P, 2P, 3P\}$ and $E_i = (a_i,\, act_i,\, o_i,\, loc_i,\, t_i,\, c_i)$ 0 indicating character whose perception governs $E_i = (a_i,\, act_i,\, o_i,\, loc_i,\, t_i,\, c_i)$ 1. Revelation-related phenomena are represented with time-indexed scalar functions $E_i = (a_i,\, act_i,\, o_i,\, loc_i,\, t_i,\, c_i)$ 2, $E_i = (a_i,\, act_i,\, o_i,\, loc_i,\, t_i,\, c_i)$ 3, $E_i = (a_i,\, act_i,\, o_i,\, loc_i,\, t_i,\, c_i)$ 4, quantifying information delivery and withholding over a narrative timeline.

4. Construction Methodology

Taxonomy design follows a multi-step, theory-driven process:

Synthesis of core narrative theory (Piper 2021, Genette 1980, Herman 2009).
Extraction of twelve primary features and fifty fine-grained aspects from literary theory, narratology, and prior NLP task surveys.
Definition of three evaluation axes (Scale, Mode, Variance), informed by benchmark design best practices.
For each aspect-feature pair, creation of a canonical evaluation question to instantiate expected model behavior.
Publication of the comprehensive taxonomy as an appendix table and a persistent spreadsheet resource.

This construction approach ensures that each aspect is operationalizable for empirical evaluation and that the taxonomy supports fine-grained mapping from theoretical desiderata to practical benchmarks (Hamilton et al., 10 Oct 2025).

5. Benchmark Survey and Alignment to Taxonomy

A systematic literature review identified 78 narrative-relevant benchmarks from the prior twelve years that satisfy criteria of public code/data, model-agnostic API, and qualitative narrative focus. Of these, 39 offered open data. Each was mapped to taxonomy aspects by computing an edit distance between benchmark properties (Scale, Mode, Variance) and the NarraBench ideal for that aspect:

Edit Distance	Alignment Category	Number of Benchmarks
0	Good	10
1	Decent	14
2	Poor	10
3	Bad (excluded)	5

Benchmarks with edit distance up to 2 are retained, yielding 34 usable resources. Overall, 27% of the 50-aspect taxonomy is covered. Per-dimension coverage is highly uneven: 19 benchmarks for Story, 2 for Narration, 5 for Discourse, and 5 for Situatedness.

6. Gaps, Limitations, and Recommendations

The survey uncovers substantial gaps in both theoretical coverage and methodological diversity:

Events: No benchmarks directly test event chain reconstruction or causal framing.
Style: Absence of benchmarks for allusion, figurative language, imageability, or syntactic complexity.
Perspective: Point-of-view and focalization detection are largely missing from current benchmarks.
Revelation: Only suspense (via ConflictBank) is evaluated; curiosity and surprise are not addressed.
Subjectivity: 37 of 39 available benchmarks use deterministic (single-answer) scoring; consensus and perspectival (open-ended) evaluation remain rare.
Token-level Annotation: Predominance of holistic/global tasks; per-token annotation (e.g., for entities, style) is underexplored.
Multilingual/Multimodal: Only 4 of 39 benchmarks are multilingual; approximately 5% accommodate images or video.

Recommended directions include developing event-structure benchmarks (requiring models to output event tuples or causal graphs), style-oriented tasks (e.g., allusion detection, imageability rating), open-ended generation for moral inference and authorial intent, time-series revelation scoring (suspense, surprise), release of code/data with non-deterministic scoring, and expansion to low-resource languages and multimodal narrative formats. A plausible implication is that comprehensive modeling of narrative understanding in LLMs necessitates resources that reflect the constitutive subjectivity and perspectival nature of narrative phenomena (Hamilton et al., 10 Oct 2025).

7. Significance and Future Development

NarraBench is provided as both a “fixed taxonomy” and a living, extensible resource (via GitHub spreadsheet), intended to be incrementally expanded by the research community. By systematically articulating theoretical desiderata, instantiating formal evaluation schemas, and empirically documenting current gaps, NarraBench aims to guide development of future benchmarks that comprehensively assess LLM narrative understanding. Its methodology foregrounds the need for theory-aligned, subjectivity-sensitive, and multimodal narrative evaluation, and establishes a unified foundation for subsequent work in this domain (Hamilton et al., 10 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

NarraBench: A Comprehensive Framework for Narrative Benchmarking (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NarraBench Taxonomy.