Narratives Dataset: Scope & Techniques
- Narratives datasets are large-scale corpora combining text, video, and audio to support complex analysis of narrative structure, character, and emotion.
- They incorporate diverse annotation schemes, including fine-grained emotion and structural labels, enabling robust evaluation of reading comprehension and long-context reasoning.
- These resources drive advancements in multimodal generation, retrieval benchmarks, and domain-specific tasks, setting new standards in narrative AI research.
A narratives dataset is a corpus constructed to support the computational analysis, modeling, and understanding of narrative text, video, or multimodal stories at scales and levels of complexity that go substantially beyond sentence-level or non-narrative collections. These resources provide annotated or semi-structured data enabling tasks such as reading comprehension over long contexts, character and emotion modeling, narrative structure identification, video–text alignment, and narrative-based recommendation or generation. Recent datasets span multiple modalities (text, vision, audio), genres (fiction, news, medical, conversational), languages, and annotation granularities, facilitating research on discourse, memory, reasoning, abstraction, causal/event structure, and affect in narrative media.
1. Types and Composition of Narratives Datasets
Narratives datasets are highly heterogeneous, varying across dimensions of medium, annotation, and purpose:
Long-Form Narrative Text: Collections such as NarrativeQA (Kočiský et al., 2017), NarrativeXL (Moskvichev et al., 2023), and NarraSum (Zhao et al., 2022) contain novels, movie scripts, or episode summaries, often spanning thousands to hundreds of thousands of tokens, enabling evaluation of long-context reasoning and summarization.
Multimodal and Visual Storytelling: Datasets like 2K-Characters-10K-Stories (Yin et al., 5 Dec 2025) and M-SyMoN (Sun et al., 2024) provide image or video sequences aligned with texts, labels, or control signals, supporting research in sequence-consistent generation and video-to-narrative alignment.
Character and Emotion Analysis: FiSCU (Brahman et al., 2021), CHATTER (Baruah et al., 2024), and DENS (Liu et al., 2019) target the extraction of character attributes, roles, or emotions from multi-sentence to long-form passages, using taxonomy-based labels and crowd or expert annotation.
Narrative Structure and State Modeling: CompRes (Levi et al., 2020) and PASTA (Ghosh et al., 2022) focus on element-wise structural annotation (e.g., Complication, Resolution, Success) or participant state inference and counterfactual revision.
Domain-Specific Corpora: MedicalNarratives (Ikezogwo et al., 7 Jan 2025) operationalizes the "narratives dataset" for vision-language learning in medicine, pairing localized speech/cursor traces and dense region–text annotation in instructor videos.
Cross-Lingual and Societal Discourse: StoryDB (Tikhonov et al., 2021) and PartisanLens (Maggini et al., 7 Jan 2026) aggregate narrative plots or headlines in dozens of languages, often with tags for genre, type, stance, or rhetorical strategy.
The size of such datasets now ranges from hundreds (CompRes: 1,099 sentences) to hundreds of thousands (StoryDB: ≈340,000 stories; NarrativeXL: ≈1 million QA items) or millions of samples (MedicalNarratives: 4.7 million image–text pairs).
2. Annotation Schemes, Taxonomies, and Protocols
Narratives datasets are distinguished by advanced annotation designs tailored for long context and complex semantics:
Emotion and Label Taxonomies: DENS uses a modified Plutchik wheel: Joy, Sadness, Anger, Fear, Anticipation, Surprise, Disgust, Love, Neutral, with consensus voting to achieve quality (Fleiss’s κ > 0.4) (Liu et al., 2019). CHATTER applies a fine-grained TVTropes taxonomy (N_a = 13,324 tropes), with human-validated positives and antonym/random negatives (Baruah et al., 2024).
Structural and Event-Level Annotation: CompRes adapts Labov & Waletzky’s model to news (Complication, Resolution, Success), annotated at the sentence-level with high inter-annotator agreement (e.g., κ = 0.82 for Complication) (Levi et al., 2020). PASTA (Ghosh et al., 2022) collects participant states, counterfactual states, and corresponding minimally-edited stories, requiring workers to infer unstated properties entailed by the narrative.
Multimodal Correspondence: 2K-Characters-10K-Stories builds a multi-channel control annotation for visual story synthesis—disentangling identity (C_ID), pose (P_ID), expression (E_ID), composition (C_TAG)—with expert verification for each phase (Yin et al., 5 Dec 2025). M-SyMoN performs granular video–clip to narration-sentence alignment in 7 languages, using both automatic presegmentation and human annotation (average IoU: 83.1%) (Sun et al., 2024).
News, Discourse, and Societal Dimensions: Narratives datasets in the news domain (Navigating News Narratives (Raza, 2023), PartisanLens (Maggini et al., 7 Jan 2026), UKElectionNarratives (Haouari et al., 8 May 2025)) annotate for multidimensional bias, conspiracy, stance, narrative label (e.g., hyperpartisan, PRCT), and rhetorical strategy.
Quality Control: Consensus and adjudication protocols are common (e.g., majority-vote + expert tie-break in DENS; batch-based Krippendorff's α computation in CHATTEREval (Baruah et al., 2024)). LLM-generated or rated annotation (e.g., in REGEN's auto-rater (Su et al., 14 Mar 2025)) introduces automated scoring of text quality and relevance.
3. Tasks and Benchmark Definitions
Distinct narrative QA and understanding tasks have emerged, enabled by these datasets:
- Reading Comprehension (RC) and Question Answering: NarrativeQA (Kočiský et al., 2017) requires answer generation or selection over summaries or full books/scripts, operationalized via conditional likelihood maximization and assessed with BLEU, METEOR, ROUGE-L, and MRR. NarrativeXL (Moskvichev et al., 2023) introduces reading comprehension tasks with memory-retention control and hierarchical summary reconstruction.
- Abstractive Summarization: NarraSum (Zhao et al., 2022) aligns long plot documents with human abstracts, using ROUGE-N, SummaC faithfulness, and human evaluation for quality control.
- Character-Centric Modeling: Tasks such as Character Identification (rank c* ∈ C based on masked description and summary) and Character Description Generation (generate D̂ | S, c) appear in FiSCU (Brahman et al., 2021), with evaluation via accuracy, BLEU-4, ROUGE, and BERTScore-F1.
- Emotion and Narrative Structure Classification: DENS (Liu et al., 2019) frames multi-class emotion detection at passage level; CompRes (Levi et al., 2020) enables sentence-level multi-label prediction of narrative structure elements, reporting F₁ up to 0.70.
- Participant State Inference and Counterfactual Generation: PASTA (Ghosh et al., 2022) formalizes state entailment (binary classification), minimal counterfactual revision (generation), and state-change explanation, with human-acceptability metrics and contrastive accuracy.
- Vision-Language and Sequence Consistency: 2K-Characters-10K-Stories (Yin et al., 5 Dec 2025) evaluates narrative coherence (CSD/CIDS metrics), control fidelity (per-frame alignment), and image quality, testing sequence-consistent multimodal generation.
- Narrative Discourse Alignment and Retrieval: FaNS (Akter et al., 2023) pairs news narratives for fine-grained similarity measurement across 5W1H facets.
- Medical Vision–Language Tasks: MedicalNarratives (Ikezogwo et al., 7 Jan 2025) provides CLIP-style semantic and dense region–text contrastive objectives, benchmarking zero-shot classification and cross-modal retrieval across modalities.
4. Methodological and Modeling Benchmarks
Narratives datasets support rigorous baseline and advanced modeling evaluation:
- Neural and Classical Baselines: DENS reports classical TF-IDF + SVM (micro-F1 0.450), lexicon approaches, RNN/BiLSTM, self-attention, ELMo, and pre-trained BERT_LARGE (0.604) (Liu et al., 2019). CompRes shows RoBERTa F₁ ≈ 0.70 on structure labeling (Levi et al., 2020). NarrativeQA reveals that neural span prediction is effective for summaries (e.g., Bi-DAF BLEU-1: 33.5), but models perform poorly on long stories.
- Long-Context and Memory Benchmarks: NarrativeXL (Moskvichev et al., 2023) evaluates LLMs and humans on memory-intensive QA; accuracy drops with increased context and "retention demand." Even 100k-token models (Claude v1.3, GPT-4) do not fully bridge the gap on long-range questions.
- Multimodal Generation and Retrieval: In 2K-Characters-10K-Stories, fine-tuned OmniGen2 models surpass all prior open baselines on narrative coherence (+32.65 pose, +31.85 expression control fidelity) (Yin et al., 5 Dec 2025); MedicalNarratives’ GenMedCLIP substantially improves zero-shot retrieval/accuracy over all medical vision–language competitors (Ikezogwo et al., 7 Jan 2025).
- Cross-Lingual and Societal Task Benchmarks: PartisanLens (Maggini et al., 7 Jan 2026) and StoryDB (Tikhonov et al., 2021) support evaluation of multilingual representations and transfer. M-SyMoN (Sun et al., 2024) probes both intra-lingual and cross-lingual vision–language alignment; supervised fine-tuning raises F1 by ~1 point across all languages.
- Metrics: Besides standard classification and generation metrics (accuracy, F1, BLEU, ROUGE, BERTScore), narrative datasets often introduce custom or composite scores: e.g., weighted F1 in CHATTEREval (Baruah et al., 2024), control fidelity and sequential similarity in vision–language stories, facet-based narrative similarity (FaNS) (Akter et al., 2023).
5. Key Challenges and Insights from Empirical Results
Research using narratives datasets surfaces core difficulties and open questions:
- Long-Range Reasoning: Integrating events, resolving coreference, and aggregating information across >50,000 words (NarrativeXL (Moskvichev et al., 2023), NarrativeQA (Kočiský et al., 2017)) remain unsolved; neural models lag humans by wide margins in memory-intensive QA and contextually abstractive summarization.
- Entity, State, and Causality Modeling: Both character-level tasks (FiSCU (Brahman et al., 2021), CHATTER (Baruah et al., 2024)) and state-entailment/counterfactual editing (PASTA (Ghosh et al., 2022)) reveal substantial human–machine gaps in abstraction, trait inference, and logical/factual consistency.
- Annotation Quality and Agreement: Datasets employing detailed guidelines, multi-round consensus, and expert adjudication (e.g., DENS, CHATTER, M-SyMoN) achieve reliable label quality (e.g., DENS majority or better on 94.5% of cases, CHATTEREval Krippendorff’s α = 0.448), though challenging labels remain (e.g., under-represented trope/emo classes, or states requiring implicit knowledge).
- Multimodal and Cross-Lingual Constraints: Fine-grained alignment (video/clip–text) is demanding in both human and model settings—mean IoU for annotation is only 83.1% in M-SyMoN (Sun et al., 2024). Cross-lingual generalization is partial and variable, with performance gains dominated by shared-genre or close-related language pairs (StoryDB, PartisanLens).
- Extractive Bias and Faithfulness: Generative models trained on narratives tend to copy text segments (e.g., T5-FT on TellMeWhy (Lal et al., 2021) answers copy 85.9% from the source), and are prone to hallucinate, omit, or misattribute events or attributes under the constraints of long-range coherence.
- Automated and Human-in-the-Loop Protocols: Recent datasets optimize data quality or annotation efficiency with hybrid pipelines: Quality-Gated MMLM loops (2K-Characters-10K-Stories), LLM auto-rating (REGEN (Su et al., 14 Mar 2025), MedicalNarratives), and iterative human–machine cycles for label correction and confidence assignment (UKElectionNarratives (Haouari et al., 8 May 2025)).
6. Applications and Research Directions
Narratives datasets provide the foundation for a spectrum of research:
- Development of scalable LLMs and architectures for ultralong context and memory
- Narrative-driven dialogue and story generation, including character-, emotion-, and state-level planning
- Vision–LLMs with explicit control for identity, pose, and event sequencing
- Bias, framing, and stance analysis in news and media at both narrative and rhetorical levels
- Cross-cultural and cross-lingual story representation and transfer
- Commonsense inference, counterfactual reasoning, and narrative manipulation tasks
Potential future work includes multimodal and graded attribution modeling (CHATTER (Baruah et al., 2024)), memory-augmented architectures, integration of retrieval-based commonsense resources, chain-of-thought or multi-hop inference over event/character graphs, and fine-grained, temporally resolved video–sentence alignment.
7. Representative Datasets: Scope and Salient Properties
| Dataset | Domain/Modality | Scale | Key Annotation/Task | Notable Results/Benchmarks |
|---|---|---|---|---|
| NarrativeQA (Kočiský et al., 2017) | Books, scripts, QA | 1,567 docs / 46k QAs | RC, answer gen/selection (BERT, ASR) | Large human–machine gap on long docs |
| NarrativeXL (Moskvichev et al., 2023) | Books, QA, memory | 1,500 books / ≈1M QAs | Retention analysis, scene reconstr. | LLMs degrade on long context/memory |
| NarraSum (Zhao et al., 2022) | Film/TV plots, summariz. | 122k doc–summary pairs | Abstractive summarization | SOTA transformers ≪ human quality |
| DENS (Liu et al., 2019) | Narratives, emotion | 9,710 passages | 7-class emotion labeling | BERT_LARGE: F1=0.604 (7-class) |
| 2K-Characters-10K-Stories (Yin et al., 5 Dec 2025) | Vision-language, stylized | 2k char. × 10k stories, 75k images | Control (identity, pose, expr., comp.) | Outperforms open V+L baselines |
| FiSCU (Brahman et al., 2021) | Literary, characters | 9,499 char. descriptions | ID & Description Gen | SOTA models 74–83% acc. vs 92% human |
| CompRes (Levi et al., 2020) | News, structure | 1,099 sentences | Complication/Resolution/Success assay | F1=0.7 (RoBERTa) |
| PASTA (Ghosh et al., 2022) | Short stories, state | 10,743 four-tuples | State entailment, counterfactual revision | RoBERTa: 88% acc., 54% accept. (T5-L) |
| MedicalNarratives (Ikezogwo et al., 7 Jan 2025) | Medical V+L, instruction | 4.7M I–T pairs (+1M dense annot.) | Vision–lang. contrastive pretraining | GenMedCLIP ↑~4.7% over SOTA CLIPs |
| StoryDB (Tikhonov et al., 2021) | Wiki-plots, cross-lingual | ≈340k stories, 42 languages | Tag/genre labeling, multiling. eval | AUC-ROC ≈ 0.75–0.80 (XLM-R/XLMR) |
| M-SyMoN (Sun et al., 2024) | Video–text, multi-lang | 13,166 videos, 2,136 h; 101 h annot. | Clip–sentence alignment | +16 pp over prior best (ClipAcc/SentIoU) |
This collection of resources illustrates the methodological and technical landscape of the narratives dataset, enabling a wide spectrum of research at the frontier of narrative, discourse, and story-based artificial intelligence.