Papers
Topics
Authors
Recent
2000 character limit reached

DramaBench: Drama-Script Evaluation Framework

Updated 24 December 2025
  • DramaBench is a comprehensive framework for evaluating drama-script continuations using six orthogonal axes: format, narrative, character, emotion, logic, and conflict.
  • It combines deterministic rule-based parsing with LLM-driven annotations to compute precise metrics like Format Error Rate, Narrative Efficiency, and Logic Consistency.
  • The framework provides actionable diagnostic feedback and fair model benchmarking through aggregated scores such as DramaBenchScore and AvgRank on a large script dataset.

DramaBench is the first large-scale, multidimensional evaluation framework targeting drama-script continuation, establishing rigorous and reproducible metrics across six distinct axes of dramatic quality. It enables comprehensive assessment of generative LLMs’ abilities to synthesize contextually consistent, emotionally engaging, and structurally coherent continuations of professionally written scripts. Combining deterministic rule-based analysis with LLM-driven annotation and statistical metric aggregation, DramaBench supports both diagnostic feedback and benchmarking, setting a new standard for creative-writing model evaluation (Ma et al., 22 Dec 2025).

1. Evaluation Dimensions

DramaBench introduces six formally independent axes, each identified as necessary for drama continuation:

  1. Format Standards: Deterministic parsing enforces compliance with industry Fountain screenplay syntax and examines the balance of action and dialogue. Metrics include Format Error Rate (FER), Novelization Index (NI), and Dialogue–Action Ratio (DAR), with targets FER < 1%, NI < 0.35, DAR ∈ [1.0, 2.0].
  2. Narrative Efficiency: An LLM-based extractor identifies and classifies narrative “beats” as driver (plot-advancing), static (descriptive), or redundant. Effective Narrative Rate (ENR) and Beats-Per-Page (BPP) measure density of plot advancement.
  3. Character Consistency: The LLM profiles character personas from context, then labels each utterance for in-character fidelity versus out-of-character (OOC) violations. Metrics include OOC Rate and Voice Distinctiveness (VD).
  4. Emotional Depth: LLM labeling tracks per-scene protagonist emotion dynamics (valence and arousal), emotional arc shifts, and the presence of complex emotions. Arc Score and Complexity Ratio (CR) quantify these aspects.
  5. Logic Consistency: The LLM extracts atomic “hard” facts from the context, labeling continuation adherence or contradiction (violated/maintained). Logic Break Rate (LBR) and Context Coherence (CC) provide strict measures of factual coherence.
  6. Conflict Handling: The core conflict trajectory is classified as escalation, twist, pause, resolution, or dropped, with pointwise metrics (Conflict Score, Drop Rate) reflecting tension management.

Each dimension is operationally independent, as shown by near-zero mean absolute inter-dimension correlations (mean |r| = 0.014 over 8,824 samples; maximum |r| = 0.035), confirming that no axis is redundant and that each domain is separately actionable for model development.

2. Methodology and Metric Computation

For each continuation task (defined by a script context and a model-generated continuation), DramaBench applies the following dimension-specific evaluation procedures:

  • Format Standards: A deterministic Fountain parser computes FER, NI, and DAR directly from output tokens. Any parse violation lowers the Format score.
  • Narrative Efficiency: An LLM breaks the script into narrative beats, labels each, and ENR is computed as the fraction of driver beats. BPP scales driver beats by continuation length to standardize comparisons.
  • Character Consistency: The LLM synthesizes prototypical personas from the context and labels dialogue lines as In_Character, Neutral, or OOC. OOC Rate = NOOC/N_{\mathrm{OOC}}/ (total dialogue lines); VD = Nin_character/N_{\mathrm{in\_character}}/ (total lines).
  • Emotional Depth: For each scene, the LLM annotates protagonist opening/closing emotion, arc shifts, and complex emotions. Arc Score is per-scene binary (shift/static); CR reflects the frequency of complex emotions.
  • Logic Consistency: “Hard facts” from the context are extracted via prompting and then each fact is checked for consistency in the continuation. LBR = Nviolated/(Nviolated+Nmaintained)N_{\mathrm{violated}}/(N_{\mathrm{violated}} + N_{\mathrm{maintained}}); CC simply counts maintained facts.
  • Conflict Handling: The LLM classifies continuation resolution as escalation (+2 points), twist (+2), pause (+1), resolution (0), or dropped (–5).

To compare models, DramaBench offers two overall aggregation strategies:

  • DramaBenchScore: Dimensional scores are rescaled to [0,1], then averaged.
  • AvgRank: Models are ranked per dimension (lower=better), then mean rank is reported.

3. Dataset Construction and Experimental Protocol

DramaBench leverages a dataset of 1,103 English-language short scripts, each professionally written and formatted in Fountain. Each script is split, scene-aware, into context (≈51 lines, 381 tokens) and continuation (≈62 lines, 401 tokens). The benchmark encompasses 8,824 model-script evaluations (1,103 scripts × 8 models).

Evaluated models include Claude Opus 4.5, DeepSeek v3.2, GLM-4.6, Gemini 3 Pro, Kimi K2, MiniMax M2, GPT-5.2, and Qwen3-Max. Statistical reliability is maintained through 252 Mann–Whitney U tests (28 model pairs × 9 metrics), with a Benjamini–Hochberg FDR correction (q=0.05). 65.9% of comparisons yield significant differences. Human validation on 188 scripts (17%) quantifies LLM–human agreement via Pearson correlation (r) and Cohen’s κ, with substantial concordance on Logic Consistency (r=0.48), Emotional Depth (κ=0.53), and Conflict Handling (κ=0.42), but revealed evaluator biases on Narrative Efficiency (r=0.07) and Character Consistency (r=–0.04).

4. Dimension Independence and Diagnostic Power

Spearman correlation analysis across the five content dimensions (excluding Format, which exhibited zero variance) demonstrates strong independence (mean |r| = 0.014, maximum |r| = 0.035). This is consistent across the full sample of 8,824 evaluations and stable over different models (standard deviation σ_r = 0.0053). A direct implication is that each evaluation axis isolates a discrete quality of dramatic writing, supporting targeted error taxonomy and contrastive learning for model improvement.

5. Actionable Application Guidelines

The DramaBench protocol is designed for practical adoption in both research and production model assessment:

  1. Dataset Preparation: Curate scripts in Fountain or convert existing data; split at scene boundaries.
  2. Format Analysis: Apply a deterministic Fountain parser for FER, NI, and DAR; correct prompt templates or pre-/post-processing as required to enforce FER ≈ 0.
  3. Dimension Labeling: Deploy an LLM with structured prompts per dimension for scenario-specific annotation—beat identification, persona extraction, emotion arc tracing, fact consistency checking, and conflict mapping.
  4. Metric Aggregation: Use the provided formulas to compute per-dimension numeric scores from labeled data.
  5. Model Comparison: Normalize dimensional metrics and compute DramaBenchScore or AvgRank for system-level assessment and diagnostics.
  6. Fine-grained Error Analysis: Aggregate labeled errors by category (OOC lines, redundant beats, logic breaks, etc.) for contrastive example mining and targeted fine-tuning.
  7. Iterative Development: Use negative labels for DPO or reward model fine-tuning; re-evaluate post-training iteration.

This pipeline enables interpretable and actionable feedback, directly informing model-specific development priorities and evidencing strengths or failures in creative-writing capabilities.

DramaBench addresses deficiencies of existing story-generation and continuation benchmarks by evaluating model outputs across multiple, rigorously orthogonal axes and providing both rule-based and LLM-powered annotation regimes. Existing evaluation frameworks typically neglect dimensions such as format conformance or multi-character emotional arcs. The substantial inter-human/LLM agreement on key axes and the reproducibility of the metrics position DramaBench as a critical standard for research in story AI and creative-writing LLMs.

A plausible implication is that the fine-grained feedback structure supports not only fair ranking but also gradient-like, dimension-specific reward modeling. The finding that Narrative Efficiency and Character Consistency referencing exhibit low annotator agreement highlights inherent subjectivity, suggesting that future work may further standardize annotation protocols or employ model-ensemble annotation to mitigate bias.

7. Prospects for Future Research

DramaBench establishes a toolchain for reproducible, interpretable, and multidimensional script continuation assessment. Immediate areas for refinement include improving annotator consensus on subjective axes, extending the framework for multilingual or cross-genre scripts, and integrating adversarial challenge sets to probe model generalization. The framework is positioned to accelerate the development of LLMs exhibiting not only syntactic fluency or coherence, but also robust dramatic sensibility and adherence to creative-writing conventions (Ma et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to DramaBench.