Chinese Film Script Continuation Benchmark
- The paper introduces a multi-metric evaluation framework that assesses LLM performance on Chinese film script continuation using 53 culturally significant films.
- It details a rigorous methodology involving dataset construction, script segmentation, automated metrics like ROUGE-L and structural similarity, and LLM-as-Judge evaluation.
- Experimental results highlight GPT-5.2’s superior structural fidelity and overall quality over Qwen-Max-Latest, with notable differences in composite scores.
The Chinese Film Script Continuation Benchmark is a systematic framework for evaluating the performance of LLMs on culturally specific narrative tasks in Chinese creative writing. As LLMs are increasingly deployed for creative and cinematic script generation, their capacity to maintain narrative, stylistic, and formal conventions of Chinese film scripts requires rigorous and reproducible assessment. The benchmark introduces a comprehensive, multi-metric evaluation paradigm for script continuation, leveraging 53 classic Chinese film scripts and comparative analyses of leading LLMs, specifically GPT-5.2 and Qwen-Max-Latest (Cao et al., 21 Jan 2026).
1. Dataset Construction and Preprocessing
1.1 Film Selection
The benchmark draws on 53 Chinese films spanning the period from 1922 to 2021, ensuring coverage of a wide chronological and generic spectrum—including comedy (Detective Chinatown I & II), drama (Happy Together), martial arts (Crouching Tiger Hidden Dragon), historical (Bronze Swallow Terrace), and romance genres. Selection was based on cultural and historical significance (with exemplars such as Farewell My Concubine (1993), Chungking Express (1994), Eat Drink Man Woman (1994)), as well as completeness and availability of scripts (no missing “first” or “second” half).
1.2 Script Segmentation and Cleaning
Preprocessing involved UTF-8 text normalization and removal of superfluous annotations. Each film script was then split at the character-count midpoint into two segments: a “first half” (U), serving as the prompt context, and a “second half” (D), used as a reference continuation. Quality filtering ensured that both halves of each script were intact.
1.3 Sample Generation and Filtering
For each film, three independent script continuations were generated per model (GPT-5.2 and Qwen-Max-Latest) using a temperature parameter of 0.7. A chunked generation protocol enabled processing of long contexts (with a maximum of 10 API calls per sample). Continuation validity required: (1) output length between 60% and 90% of first-half character count, (2) absence of meta-discourse, and (3) successful JSON output parse and API call. This led to a total of 303 valid samples (157 from GPT-5.2 with 98.7% validity; 146 from Qwen-Max-Latest with 91.8% validity).
| Model | Theoretical | Valid | Validity Rate |
|---|---|---|---|
| GPT-5.2 | 159 | 157 | 98.7 % |
| Qwen-Max | 159 | 146 | 91.8 % |
2. Evaluation Methodology
2.1 Automated Metrics
- ROUGE-L quantifies lexical overlap by computing the F₁ score of the longest common subsequence (LCS) between generated and reference continuations after Chinese tokenization. Formally,
- Structural Similarity operationalizes conformance to script format via five profile features: scene heading ratio, dialogue ratio, blank line ratio, stage direction ratio, and emphasis density. For each, the similarity score is:
Scores in indicate increasing format fidelity.
2.2 LLM-as-Judge Evaluation
The DeepSeek-Reasoner LLM functions as an expert screenwriter evaluator. For each sample, it outputs a structured JSON with fields for overall similarity (0–100), plot-event alignment, character consistency, tone-style match, format match, ending closure, mechanism attribution (per Entman’s elements), and supporting evidence via text snippets.
2.3 Composite Scoring
A composite metric aggregates lexical, structural, and holistic aspects:
3. Experimental Results and Quantitative Comparison
3.1 Validity and Generation Stability
GPT-5.2 achieved a higher validity rate (98.7%) than Qwen-Max (91.8%). Qwen-Max’s generation was less stable, with 46% of invalid samples attributable to API timeouts and 31% to content moderation blocks.
3.2 Metric-Based Comparison
Paired-sample (n=144) mean scores and effect sizes:
| Metric | GPT-5.2 | Qwen-Max | Difference [95% CI] | Cohen’s d |
|---|---|---|---|---|
| ROUGE-L | 0.2114±0.0497 | 0.2230±0.0502 | –0.012 [–0.016, –0.007] | –0.43 (small) |
| Structural Similarity | 0.9299±0.1562 | 0.7473±0.3495 | +0.183 [+0.118, +0.247] | +0.46 (small) |
| Overall Quality | 44.79±14.42 | 25.72±12.06 | +19.07 [+16.08, +22.06] | +1.04 (large) |
| Composite Score | 0.4979±0.0702 | 0.3906±0.1156 | +0.107 [+0.087, +0.128] | +0.84 (large) |
Cohen’s d was used for effect size, computed as:
Qwen-Max achieved a marginally higher ROUGE-L (reflecting slightly greater lexical overlap), whereas GPT-5.2 outperformed in structural preservation, overall quality, and composite score, with large effect sizes for overall quality and composite score ().
4. Qualitative and Analytical Insights
4.1 Fine-Grained Analysis of GPT-5.2
GPT-5.2 demonstrated high character consistency—for example, faithfully preserving Cheng Dieyi’s psychological complexity in Farewell My Concubine, with Character Consistency over 90/100. The model also excelled in tone-style matching, imitating melancholic narrative registers and pacing characteristic of the respective films. Format preservation was robust, with structural similarity values reaching ≈0.99 in selected cases.
4.2 Diagnostic Error Analysis of Qwen-Max
Qwen-Max exhibited high structural variability ( versus 0.19 for GPT-5.2), compounded by generation instability. Incongruous shifts in tone and failure to maintain consistent dialogue markers were observed in several cases (e.g., Structural Similarity = 0.427 for Farewell My Concubine). This suggests a need for architectural or prompt-stability improvements in Qwen-Max for long-form creative generation.
5. Framework, Availability, and Limitations
5.1 Reproducibility and Code Access
Benchmark assets—including preprocessed scripts (JSONL), format profile detection modules, prompt templates, evaluation scripts, and metric implementations—are available upon request (contact: [email protected]).
5.2 Benchmark Constraints
- The DeepSeek-Reasoner judge may introduce style preference bias.
- Scope is limited by dataset size (53 films, 144 paired samples).
- The evaluation paradigm is restricted to "first half→second half" continuation.
- All reported results are specific to GPT-5.2 and Qwen-Max-Latest (as of 2025 model snapshots).
- No human expert evaluations were conducted; assessment relies solely on automated and LLM-judged metrics.
6. Prospects and Future Work
Recommendations include the introduction of blind human evaluations and cross-validation against LLM-predicted judgments to mitigate bias. Plans include expanding the film corpus both in scale and genre diversity, benchmarking additional LLMs (Claude, Gemini, etc.), and designing finer-grained generation tasks such as scene expansion, dialogue-specific continuation, and script revision. Additional experimental vectors involve prompt, temperature, and context-length interventions, as well as hierarchical or outline-guided generation to improve ending closure (Cao et al., 21 Jan 2026). A plausible implication is that further advances in evaluation and model prompting could yield significant improvements in cultural and structural fidelity for LLM-driven Chinese creative writing.