Narrative Classification Task
- Narrative classification is the supervised assignment of discrete labels—such as clause roles, event scenarios, and narrative frames—to text units at various granularities.
- It employs diverse annotation schemes from Labovian methods to multimodal entity labeling, enabling structured narrative analysis across domains.
- Modern models integrate feature-based, neural, and hierarchical architectures, delivering robust performance in media framing, fact-checking, and sociolinguistic studies.
A narrative classification task is defined as the supervised assignment of discrete labels, typically reflecting structural, functional, or role-based categories, to narrative units of text at varying levels of granularity (clause, sentence, segment, paragraph, document, or multimodal artifact). Narrative classification frameworks are foundational for computational story analysis, media studies, sociolinguistics, and the design of fact-checking pipelines. Taxonomically, such tasks span narrative element detection (Labovian clause types, Complication/Resolution), role assignment (protagonist, antagonist), event scenario tagging, media framing schemas, and fine-grained ideological/factual stances. Recent narrative classification research emphasizes theory-blended annotation schemes, neural modeling (CNN, masked LMs, instruction-tuned LLMs), and quantitative analysis of cross-domain, multilingual, and hierarchical classification performance.
1. Task Definitions and Taxonomy
Narrative classification encompasses tasks that assign one or more discrete labels to a given narrative unit , formalized as where may represent clause roles, event scenarios, narrative frames, entity roles, or ideological stances. These tasks operate at various text granularities:
- Clause/Sentence level: e.g., identifying clause type (“action”, “orientation”, “evaluation”) in Labov-based annotation, or sequence labeling of narrative elements (Saldias et al., 2020).
- Span/Segment level: segmentation and scenario/role assignment (e.g., TopicTiling segmentation plus scenario classification) (Wanzare et al., 2019).
- Document/Article level: multi-label narrative frame or persuasion technique assignment, hierarchical ideological narrative classification, or claim-to-narrative mapping (Afroz et al., 3 Dec 2025, Frermann et al., 2023, Tyagi et al., 4 Sep 2025, Christensen et al., 2023).
- Multimodal/Entity-centric level: assigning narrative roles to entities in memes (Hero, Villain, Victim, Other) (Sharma et al., 29 Jun 2025), or spatial relation types between characters and places (Soni et al., 2023).
The annotation taxonomies employed are typically adapted from discourse-analytic or sociolinguistic theory (Labovian schema, framing studies, narrative structures in communication science) or tailored for media analysis (propaganda narratives, causal micro-narratives, unsupported-claim clusters, scenario inventories) (Heddaya et al., 7 Oct 2024, Afroz et al., 3 Dec 2025, Frermann et al., 2023, Levi et al., 2020).
2. Annotation Schemes and Datasets
Robust narrative classification tasks depend on systematically annotated datasets, with frameworks chosen to optimize both construct validity and computational tractability:
- Labovian element annotation: Personal narratives are split into clauses annotated as action (chronological events), orientation (contextualizing background), or evaluation (speaker beliefs or affect), often using crowd or expert annotation with majority-vote gold labels (Saldias et al., 2020). Agreement metrics: average inter-annotator agreement per clause typically ranges 2.15–2.29/3 (unweighted); Fleiss’ κ or Cohen's κ for per-label reliability.
- Scenario/schematic annotation: Annotators assign scenario labels (from inventories of up to 200) to narrative segments corresponding to script knowledge (e.g., "eating in a restaurant"), allowing multi-label allocations reflecting overlapping everyday activities (Wanzare et al., 2019). Cohen’s κ ≈ 0.61; span-overlap agreement ≈ 67%.
- Role/situation annotation: Datasets record categorical roles for entities (protagonist, antagonist, victim) (Rønningstad et al., 6 Jun 2025), or modal/stance roles (e.g., unreliable narrator: intra-, inter-, inter-textual types) (Brei et al., 11 Jun 2025). Agreement for roles in memes: Cohen’s κ ≈ 0.75.
- Framing and narrative schema: Articles are annotated for presence of frames (Conflict, Resolution, Economic, Moral, etc. (Frermann et al., 2023)), entity-based narrative roles, or fine-grained, event-specific ideological narratives (Afroz et al., 3 Dec 2025). Inter-annotator agreement for such schemas varies: Krippendorff’s α ≈ 0.52–0.61 (for frames); lower for role-level entity extraction.
- Unsupported claim mapping: Crowdsourced annotation assigns short, expert-curated narrative labels from inventories of 40 per topic to tens of thousands of social media posts; each post receives exactly one narrative (Christensen et al., 2023).
- Causal and spatial relationship annotation: Sentences annotated for narrative-level causal micro-narratives and causal ontology labels; or for character–place spatial relations (IN, NEAR, THRU, etc.), temporal span, and narrative tense (Heddaya et al., 7 Oct 2024, Soni et al., 2023).
3. Modeling Architectures and Classification Objectives
Contemporary narrative classification employs both traditional feature-based models and large neural architectures:
- Feature-based models: Linear SVMs with POS, LIWC, dependency features for clause/narrative type classification; maximum-entropy models for paragraph-level narrativity (Saldias et al., 2020, Yao et al., 2018).
- Segmented pipelines: Hybrid approaches segment text (e.g., TopicTiling on LDA topic distributions), then classify segments using multi-layer perceptrons over tf-idf or lemmatized features (Wanzare et al., 2019).
- Neural sequence models: CNNs with GloVe embeddings and POS features for clause type prediction; BERT-/RoBERTa-based classifiers with sigmoid outputs for multi-label sentence-level narrative detection; Longformer/Transformer-based document models for frame/technique assignment (Saldias et al., 2020, Levi et al., 2022, Afroz et al., 3 Dec 2025, Frermann et al., 2023).
- Hierarchical architectures and LLM prompting: Hierarchical reasoning frameworks (e.g., FANTA and TPTC using GPT-4o-mini) perform multi-hop, prompt-guided narrative and persuasion technique classification (Afroz et al., 3 Dec 2025). Hierarchical Three-Step Prompting (H3Prompt) chains domain, main-narrative, and sub-narrative LLM calls for multilingual news classification (Singh et al., 28 May 2025).
- Context optimization for entity roles: Targeted context-window heuristics (e.g., ent2ent, sentence windows) combined with masked LLMs (XLM-RoBERTa) optimize entity-specific narrative role classification performance (Rønningstad et al., 6 Jun 2025).
- Fine-tuning and LoRA adaptation: Parameter-efficient fine-tuning (LoRA) on large LMs (Llama 3.1 8B, T0-3B, etc.) for causal micro-narratives and unsupported-claim classification (Heddaya et al., 7 Oct 2024, Christensen et al., 2023).
- Combination of classification and generation: Text-to-text models generate canonical narrative mappings; ReACT frameworks combine evidence retrieval and structured explanation (Tyagi et al., 4 Sep 2025).
The standard classification objective is cross-entropy minimization (categorical for single-label, binary for multi-label tasks), oftentimes weighted by class frequency to upweight rare narrative types. Some tasks further tune per-label thresholds to optimize recall for subtle or minority narratives.
4. Evaluation Protocols and Task-Specific Metrics
Evaluation is grounded in the metrics of contemporary NLP multi-label/multiclass classification:
- Precision, Recall, and F₁ (macro- and micro-averaged across classes) serve as the principal metrics. Tasks may also employ token/span-level accuracy for sequence tagging, sample-based F₁ for multi-label scenarios, and exact-match for complete frame/role labeling (Saldias et al., 2020, Afroz et al., 3 Dec 2025, Levi et al., 2022, Frermann et al., 2023).
- Cross-domain and multilingual validation: Assessment includes leave-one-domain/language-out experiments (to test model generalization), as well as synthetic data augmentation ablations (Singh et al., 28 May 2025, Afroz et al., 3 Dec 2025).
- Transparent retrieval metrics: For retrieval-based classifiers, explicit reporting of evidence-support and per-frame retrieval effectiveness is used (Frermann et al., 2023).
Performance benchmarks from prominent studies include:
- CNN clause-level clause type prediction: F₁ = 84.7% for aligned annotator clauses (Saldias et al., 2020).
- XLM-RoBERTa narrative role (ent2ent context): Micro F₁ = 47.75 (Rønningstad et al., 6 Jun 2025).
- Fine-tuned Llama 3.1 8B: causal micro-narrative detection F₁ = 0.87, multi-label classification F₁ = 0.71 (Heddaya et al., 7 Oct 2024).
- Multi-label RoBERTa sentence-level NEAT: avg F₁ = 0.77 (Levi et al., 2022).
- H3Prompt (LLaMA-3.2, ensemble): Macro F₁ = 0.623, sample-based F₁ = 0.516 on English dev (Singh et al., 28 May 2025).
- FANTA (GPT-4o-mini) micro F₁ = 0.724–0.767 over fine-grained narrative schema (Afroz et al., 3 Dec 2025).
- Retrieval-based frame prediction: macro F₁ = 0.61 (Frermann et al., 2023).
- Scenario segmentation/classification: segment-level F₁ up to 0.54 (with gold segmentation) (Wanzare et al., 2019).
5. Error Analysis, Generalization, and Challenges
Principal challenges observed across narrative classification tasks include:
- Annotation subjective ambiguity: Limited inter-annotator agreement on frame/role presence, reflecting underlying subjectivity; e.g., Krippendorff’s α ~ 0.52 for frames, 0.40 for entity role existence (Frermann et al., 2023). Error prevalence in Success vs. Resolution labeling often arises due to partial vs. full closure conflation (Levi et al., 2020, Levi et al., 2022).
- Role and scenario confounds: Models frequently confuse thematically similar or hierarchical scenarios (e.g., “go shopping” vs. “shopping centre”); narrative roles often co-occur and require advanced context modeling (Wanzare et al., 2019, Afroz et al., 3 Dec 2025).
- Model robustness: Transformer models substantially outperform heuristics but maintain precision-recall gaps, most pronounced for rare or ambiguous classes (e.g., “NEAR”, “THRU” spatial relations, “Humor” in counter-narratives) (Soni et al., 2023, Chung et al., 2021).
- Domain and modality shift: Cross-domain degradation is modest for well-trained transformers but pronounced for content and language lacking in-domain pretraining (Rønningstad et al., 6 Jun 2025, Levi et al., 2022).
- Coverage limitations: Most benchmarks cover only 27% of the narrative classification taxonomy devised for NarraBench; style, event schema, and subjective/revelatory aspects are consistently underrepresented (Hamilton et al., 10 Oct 2025).
6. Recent Directions and Benchmark Recommendations
Innovations in the field point toward several emerging directions:
- Chain-of-Thought and Multi-hop Prompting: Hierarchical reasoning via guided large LLMs substantively improves logical consistency in fine-grained and hierarchical narrative schemas (e.g., FANTA, TPTC, H3Prompt) (Afroz et al., 3 Dec 2025, Singh et al., 28 May 2025).
- Synthetic data augmentation: Generative augmentation (in-context LM synthesis of narrative instances) demonstrably boosts classifier F₁ by 4–10 points (Christensen et al., 2023, Singh et al., 28 May 2025).
- Retrieval-augmented and explainable predictions: Integration of evidence retrieval (sentence-level similarity or structured summary chains) with LLM explanation prompting addresses the need for interpretability and evidence grounding in narrative assignments (Tyagi et al., 4 Sep 2025, Frermann et al., 2023).
- Weak and semi-supervised expansions: Bootstrapped MaxEnt or Snippext consistency-based approaches expand the impact of small gold datasets with high-precision narrative instance harvesting or pseudo-labeling (Yao et al., 2018, Frermann et al., 2023).
- Multimodal and multilingual adaptation: Classification of narrative roles in memes (text+image+code-mix), translation-first pipelines, and multilingual model fine-tuning adapt narrative tasks to contemporary, globalized media (Sharma et al., 29 Jun 2025, Singh et al., 28 May 2025).
- Benchmarking and taxonomy expansion: NarraBench recommends expanding coverage to event skeletons, style/stance phenomena, revelation, and subjective dimensions, emphasizing per-span annotation, multi-annotator soft-labels, and distributional metrics (e.g., KL divergence, expected calibration error) (Hamilton et al., 10 Oct 2025).
7. Application Domains and Future Prospects
Narrative classification is now foundational for:
- Media framing and propaganda analytics, with interpretable mappings of macro-level stories to micro-level rhetorical tactics (Afroz et al., 3 Dec 2025, Frermann et al., 2023).
- Causal inference and social science (detection of causal micro-narratives in economic texts, event progression analysis) (Heddaya et al., 7 Oct 2024, Mousavi et al., 2023).
- Fact-checking and misinformation analysis through claim-to-narrative prediction and counter-narrative typology (Christensen et al., 2023, Chung et al., 2021).
- Literary and narratological studies (e.g., historical shifts in narrative point-of-view, protagonist space usage, unreliability classification) (Underwood et al., 2013, Soni et al., 2023, Brei et al., 11 Jun 2025).
- Scenario extraction and script learning for language understanding and downstream datasets (Wanzare et al., 2019, Yao et al., 2018).
- Interactive and explainable AI, educational media analytics, and journalistic profiling via explanation-augmented narrative detection systems (Tyagi et al., 4 Sep 2025).
Ongoing challenges include the scaling and subjectivity of narrative schemas, annotation agreement, compositionality and generalization, and integration with multimodality and real-world context. As narrative understanding tasks gain prominence in benchmarking suites (e.g., SemEval, CLEF), future research will emphasize taxonomic breadth, interpretive richness, and evidence-grounded explanations across genres, languages, and media modalities.