STAGE: Screenplay Text, Agents, Graphs & Evaluation

Updated 16 January 2026

STAGE is a benchmark that unifies screenplay narrative understanding through tasks like knowledge graph construction, scene summarization, long-context QA, and in-script role-playing.
It employs an LLM-driven extraction pipeline with reflection-based retries and human checks to ensure high-quality, structured narrative representations.
Robust evaluation metrics such as KG F1 scores and Event-Structure Consistency highlight its impact on advancing long-context modeling and agent-centric generation.

STAGE (Screenplay Text, Agents, Graphs and Evaluation) is a comprehensive benchmark and methodological paradigm for narrative understanding, structured extraction, and agent-centric evaluation over full-length movie screenplays. It provides a unified set of tasks, resources, and evaluation metrics, designed to rigorously assess models’ ability to encode, reason about, and generate from structured narrative worlds (Tian et al., 13 Jan 2026). STAGE concretely addresses four core areas: knowledge graph construction, scene-level event summarization, long-context question answering, and in-script character role-playing—each grounded in curated screenplays and their associated world representations. The benchmark emphasizes clean, annotated scripts, multi-layered knowledge graph schemas, event- and character-centric annotation, and holistic evaluation across narrative abstraction and agent consistency.

1. Task Definitions and Narrative World Modeling

STAGE defines four primary tasks—each linked to the construction and exploitation of a shared, event-centric knowledge graph extracted from screenplay text:

STAGE-KG (Knowledge Graph Construction): Input is a complete screenplay segmented into scenes. The output is an event-centric knowledge graph $G = (V, E)$ , with a fixed entity schema (Character, Event, Location, TimePoint, Object, Concept) and typed relation schema (including event-role, social, inter-event, object-related, semantic, and spatiotemporal relation types). Extraction involves staged LLM-driven parsing, schema-constrained relation detection, and reflection-based quality assurance. Merging and normalization are performed using name and description embeddings and k-NN graph clustering, with minimal human correction for low-confidence cases (Tian et al., 13 Jan 2026).
STAGE-ES (Scene-Level Event Summarization): Each scene transcript is summarized into free-form event descriptions. There are no fixed templates or event counts; summaries may abstract across multiple gold events. Evaluation uses Event-Structure Consistency (ESC), separating coverage (proportion of gold events reflected in system output) and faithfulness (proportion of summaries judged non-contradictory to the source) as judged by LLMs (Tian et al., 13 Jan 2026).
STAGE-QA (Long-Context Screenplay QA): Given a complete screenplay and a question (categorized into eight types: scene localization, character states, objects, dialogue/beliefs, temporal, terminology, causal/relational, and detailed description), models must generate free-form answers. Retrieval serves as a foundation, employing dense/sparse windowing and GraphRAG community summaries, with correctness judged by LLM-based multi-sample aggregation (Tian et al., 13 Jan 2026).
STAGE-ICRP (In-Script Character Role-Playing): Models adopt the persona of a specified character, responding to prompts in a first-person style. Inputs include a persona specification, potentially narrative memory (episodic summaries of prior dialogue/action), and “hard” narrative facts. Evaluation measures persona consistency, style consistency, and narrative faithfulness, averaged across stochastic generations (Tian et al., 13 Jan 2026).

All tasks are grounded in a unified world representation, with cross-task dependencies—e.g., the KG provides event and entity information for QA retrieval and persona construction.

2. Dataset Construction and Annotation Protocol

The STAGE dataset comprises 150 full-length films (108 English, 42 Chinese), with scripts ranging from 2,381 to 83,562 words and scene counts spanning 12–373 per movie. English scripts are segmented by regex and LLM refinement; Chinese scripts use OCR and human-in-the-loop segmentation. Annotations are produced via a combination of LLM extraction under explicit schemas, reflection steps for auto-quality scoring (0–10), constrained retries, and targeted human rework on low-confidence or schema-violating cases (≈11.7%) (Tian et al., 13 Jan 2026).

Human annotators include industry professionals, and all scenes are globally indexed for downstream grounding. Agreement metrics are high: ESC coverage (Krippendorff’s α = 0.79); faithfulness (α = 0.84); QA correctness (α = 0.80).

3. Graphical Representation and Computational Pipelines

STAGE employs a fixed schema for knowledge graph construction, representing scripts as collections of directed, typed triples (subject, predicate, object), with binary relations. Graph assembly involves LLM-driven extraction, schema filtering, similarity-based entity clustering, and both automated and minimal human adjudication. Formal notation includes name ( $e_i^{(n)}$ ) and description ( $e_i^{(d)}$ ) vector embeddings, global similarity matrices, and k-NN clustering (using eigengap heuristics) (Tian et al., 13 Jan 2026).

Entities and relations are scored for Precision, Recall, and $F_1$ relative to gold KGs. Scene nodes, dialogue nodes, and character nodes—annotated and embedded—enable integration with downstream graph-based reasoning or summarization encoders as exemplified in the DiscoGraMS CaD Graph (Chitale et al., 2024).

The extraction pipeline is enriched with “reflection-driven retries,” where the LLM evaluates extraction quality and reruns the process if scores fall below a threshold, with human revision only where automated methods fail.

4. Evaluation Metrics and Baseline Performance

Task-specific metrics are designed for narrative abstraction, entity-relation structure, and generative agent alignment:

Task	Main Metric(s)	Key Results (Best/Noted)
STAGE-KG	Entity $F_1$ , Relation $F_1$	GPT-4o (EDC): 0.67 / 0.58
STAGE-ES	Event-Structure Consistency (Faithfulness, Coverage)	GPT-4o: 0.72 / 0.78
STAGE-QA	LLM-judged correctness (correct if any of 5 samples correct)	GPT-4o (hybrid retrieval): 65.4%
STAGE-ICRP	Persona, Style, Narrative Faithfulness (scores ∈ [0, 1])	GPT-4o: 0.80 / 0.75 / 0.71

Precision, recall, and $F_1$ scores are calculated per standard formulas, with additional metrics such as ESC–Coverage and ESC–Faithfulness for event summarization. LLM-based judgments (GPT-4o) are central to event fidelity and QA correctness. Notably, large models with memory/facts inputs surpass model size scaling alone for role-playing fidelity (Tian et al., 13 Jan 2026).

5. Methodological Innovations and Cross-Task Consistency

STAGE introduces several methodological advancements:

Unified narrative world grounding: All tasks are anchored in a shared movie-level narrative world, facilitating cross-task integration.
Reflection-based LLM extraction pipeline: Iterative quality control and schema-constrained retries balance scale and annotation quality, with minimal human involvement.
Memory-grounded role-playing: For ICRP, structured narrative memory and facts yield superior coherence and persona fidelity compared to prompt-only or larger model baselines.
Event-Structure Consistency (ESC): Evaluation decouples abstraction (coverage of gold events) from factuality (faithfulness), providing granular insight into summarization models’ narrative comprehension.

A plausible implication is that fine-grained, schema-controlled representation—leveraging both semantic and structural signals across scenes, events, and agent memory—enables models to transcend the “lost in the middle” limitations typical of long-form narrative tasks.

6. Comparative Context and Directions for Extension

STAGE builds directly on, and extends, prior work in screenplay graph representation and knowledge-centric evaluation. For example, DiscoGraMS constructs a character-aware discourse graph (CaD Graph) integrating scene, dialogue, and character nodes with graph-attention and text encoders for summarization tasks (Chitale et al., 2024). STAGE’s approach generalizes this by supporting more elaborate schemas, direct linkage to long-context QA and agent modeling, and unified evaluation across multiple tasks.

The ETVA paradigm in text-to-video alignment—featuring multi-agent scene graph construction, fine-grained atomic question generation, and knowledge-augmented QA—suggests further possibilities for STAGE-style frameworks. In particular, adapting ETVA’s fine-grained atomic QA (e.g., by mapping screenplay graphs to atomic questions over narrative events or agent beliefs) or its use of auxiliary knowledge sources could provide robust foundations for multimodal evaluation and narrative fact-checking in screenplay analysis (Guan et al., 21 Mar 2025).

STAGE’s limitation to binary relations, as opposed to higher-order event structures, and its current single-modality (text) focus, points to future research integrating richer event graphs, discourse-level reasoning, multimodal (audio-visual) features, and adaptive question generation for high-level narrative constructs.

7. Significance and Research Impact

STAGE represents the first large-scale, multi-task benchmark for movie screenplay understanding that explicitly unifies knowledge extraction, abstraction, question answering, and agent-centered generation under a coherent world model. It enables fine-grained diagnostics of model performance across narrative, structural, and persona-driven axes, and its rigorous annotation and evaluation pipelines set a new standard for narrative understanding resources (Tian et al., 13 Jan 2026). This benchmark is poised to facilitate research in long-context language modeling, structured narrative reasoning, and agent-centric generation across screenwriting, creative AI, and media analytics.