WebNovelBench: Placing LLM Novelists on the Web Novel Distribution (2505.14818v1)

Published 20 May 2025 in cs.CL and cs.AI

Abstract: Robustly evaluating the long-form storytelling capabilities of LLMs remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.

Summary

Evaluating LLM Storytelling with WebNovelBench

The paper "WebNovelBench: Placing LLM Novelists on the Web Novel Distribution" introduces a novel benchmark for evaluating long-form novel generation capabilities of LLMs. This benchmark, named WebNovelBench, is constructed using a large-scale dataset of over 4,000 Chinese web novels, facilitating the assessment of models through a synopsis-to-story generation task. The framework encompasses eight narrative quality dimensions, evaluated automatically via an LLM-as-Judge approach. Scores from these evaluations are aggregated using Principal Component Analysis (PCA) and mapped to a percentile rank against comparable human-authored works.

Dataset and Evaluation Methodology

WebNovelBench capitalizes on a diverse corpus of Chinese web novels, ensuring robust representation across numerous genres and themes. This dataset serves as the foundation for a synopsis-to-narrative challenge, where LLMs are tasked with generating stories from provided synopses. The evaluation protocol leverages PCA for score aggregation, deriving weights based on the variance explanation of the dimensions. This leads to a nuanced scoring mechanism that reflects the holistic storytelling abilities of the models.

The evaluation dimensions are meticulously chosen to cover a wide array of narrative aspects, including literary device usage, sensory detail richness, character dialogue distinctiveness, and consistency of characterization. Each dimension is scored using an LLM-as-Judge method that offers a scalable and efficient alternative to human evaluation, aiming for alignment with human preferences while minimizing bias.

Performance Analysis

The benchmark assesses the performance of 24 state-of-the-art LLMs, effectively differentiating between human-written works and machine-generated narratives. Top models, such as Qwen3-235B-A22B and DeepSeek-R1, show high proficiency across all dimensions, with Qwen3-235B-A22B achieving a norm score of 5.21, indicative of its close alignment with quality human writing. This suggests that leading LLMs are approaching, if not surpassing, human performance levels in storytelling tasks.

Interestingly, the analysis identifies a convergence in performance between proprietary and open-source models, indicating that the latter are rapidly closing the gap traditionally held by closed-source counterparts. This benchmark not only highlights individual model capabilities but also provides a comprehensive landscape of current LLM storytelling prowess, guiding further development in machine-driven narrative generation.

Implications and Future Directions

WebNovelBench establishes a standardized methodology for appraising LLM storytelling, fostering a structured ecosystem for model comparisons. Its reliance on popular web novels ensures alignment with broad human literary preferences, underscoring the relevance of the benchmark within the AI research community. This provides a scalable framework for both practical applications (e.g., data quality assessment) and theoretical advancements in narrative generation.

Future enhancements could include expansion to other languages and literary forms, integration of diverse LLM judges, and refinement of evaluation criteria. This would further leverage the benchmark's principles, potentially catalyzing innovations in AI-generated creative content across global platforms. Additionally, exploring the benchmark's utility in model training and dataset refinement could illuminate new pathways for improving LLM narrative fluency and richness.

Overall, WebNovelBench represents a significant contribution to the evaluation of LLM storytelling, offering a robust, reproducible, and insightful framework for guiding future advancements in AI literature creation.