Papers
Topics
Authors
Recent
2000 character limit reached

What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation

Published 14 Dec 2025 in cs.CL | (2512.12839v1)

Abstract: In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, LongStoryEval, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an evaluation criteria structure and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: aggregation-based, incremental-updated, and summary-based evaluations. Our findings reveal that aggregation- and summary-based evaluations perform better, with the former excelling in detail assessment and the latter offering greater efficiency. Building on these insights, we further propose NovelCritique, an 8B model that leverages the efficient summary-based framework to review and score stories across specified aspects. NovelCritique outperforms commercial models like GPT-4o in aligning with human evaluations. Our datasets and codes are available at https://github.com/DingyiYang/LongStoryEval.

Summary

  • The paper introduces LongStoryEval, a benchmark of 600 novels using an eight-aspect taxonomy to address annotation and computational challenges.
  • It compares aggregation, incremental-updated, and summary-based methodologies, finding that summary-based evaluation with NovelCritique best aligns with human ratings.
  • Empirical analysis reveals strong system-level correlations with human judgments and underlines the need for scalable, bias-aware long-form narrative evaluation.

Systematic Evaluation of Book-Length Stories: Methodology, Dataset, and Model Advances

Motivation and Challenges in Evaluating Long-Form Narratives

Automatic evaluation of lengthy narratives remains a critical yet largely unresolved problem in NLG, especially given that stories exceeding 100K tokens defy the practical context length of current LLMs and the scalability of human annotation protocols. Short-form story evaluation has benefited from LLM prompts and weakly-supervised ranking [Chhun et al., 2022; Yang & Jin, 2024; (Yang et al., 2024)], but direct transfer of these evaluation paradigms to full-length books is hindered by (1) annotation cost, (2) ambiguous or inconsistent evaluation criteria, and (3) architectural and computational processing limits.

LongStoryEval: Curation and Structure

LongStoryEval constitutes a high-coverage benchmark tailored for this underexplored domain, featuring 600 novels with an average of 121K tokens each and comprehensive metadata (title, genres, premise, reviews, reviewer profiles). Critically, each novel is paired with both aggregate rating statistics and a large sample of reader reviews, providing a foundation for supervised and semi-supervised methodologies and analyses of evaluation subjectivity.

The data curation process ensures that all items are absent from pretraining data of tested LLMs, restricting potential contamination and evaluation leakage. Figure 1

Figure 1: Pipeline for constructing LongStoryEval, from raw reader reviews to aspect-guided structured critiques suitable for modeling.

Reviews are automatically restructured using LLMs to extract salient evaluation aspects, leading to the identification and normalization of over 1,000 unique user-mentioned criteria. These aspects are mapped into an eight-aspect hierarchical taxonomy spanning objective structural concerns (plot, characters, writing, world-building, themes) as well as subjective reader experience facets (emotional impact, enjoyment/engagement, expectation fulfillment). Figure 2

Figure 2: Taxonomy of evaluation criteria and how readers interact with these during the immersive process of consuming a narrative.

Distributional statistics confirm substantial review diversity and criteria representation across genres. Figure 3

Figure 3: Score and book length distributions reveal both evaluation diversity and a wide coverage of narrative lengths.

Figure 4

Figure 4: Empirical incidence of evaluation aspects as extracted from natural reader reviews, affirming the taxonomy coverage.

Decomposition of Evaluation Methodologies

Three evaluation paradigms for long-form stories are compared:

  • Aggregation-based: Evaluate and score each chapter/segment independently, aggregate (e.g., by mean) to produce global metrics. This ensures sensitivity to local anomalies but is computationally expensive.
  • Incremental-updated: Simulate a human incremental reading and updating of critique, updating scores and assessments as each chapter is processed with full access to prior summaries and judgments.
  • Summary-based: Employs incremental summarization (with chapter-level updates) to produce a compact book synopsis and character sketch, which are used as input for book-level evaluation. Figure 5

    Figure 5: Architecture and data-flow overview for aggregation-based, incremental-updated, and summary-based evaluation approaches applied to book-length content.

Results indicate that single-pass evaluation over the entire book is computationally infeasible or ineffective, and that aggregation-based and summary-based methods dominate in both fidelity to human ratings and computational cost-effectiveness.

Model Development: NovelCritique

NovelCritique is introduced as an 8B-parameter Llama 3.1-based model, instruction-tuned on structured critiques and normalized scores from LongStoryEval. The model exclusively leverages the summary-based framework, absorbing both textually rich summaries and reviewer judgments aligned with the established aspect taxonomy. Bias mitigation and reviewer normalization are applied during training to counteract rating and review selection biases in the GoodReads-derived review corpus. Figure 6

Figure 6: Schematic for NovelCritique highlighting incremental summarization and the transformation of summary/context into aspect-structured critique and scoring.

Ablation studies show that organization of reviews around explicit criteria, review bias correction, and user-standard normalization meaningfully enhance model alignment with ground truth. Figure 7

Figure 7: Decomposition of system performance gains attributable to core components of the NovelCritique pipeline.

Empirical Results and Analysis

NovelCritique achieves strong system-level Kendall-tau correlations (up to 27.7 for overall scores) with human ratings—substantially surpassing GPT-4o and other LLMs on both holistic and aspect-specific axes. Aggregation- and summary-based strategies are consistently superior to incremental-updated or naive single-pass evaluations.

Analysis reveals:

  • Plot and character aspects are the primary contributors to final ratings; writing and world-building, while frequently mentioned, have lesser marginal impacts.
  • Subjective criteria (emotional impact, enjoyment, expectation fulfillment) are critical for distinguishing between works of similar technical merit.
  • High-quality input summaries marginally improve summary-based model performance over condensed/cheaper alternatives, but diminishing returns are observed beyond a certain summary granularity threshold.

Qualitative review examination exposes a tendency of strong LLMs to overlook story weaknesses unless explicitly prompted, often skewing towards leniency on low-rated works (Figure 8). In comparison, NovelCritique provides more nuanced and critical commentary, with both strengths and weaknesses reflected commensurately with the ground-truth reader assessments. Figure 8

Figure 8: Diagnostic critique illustration for a poorly rated novel, with generated weaknesses explicitly highlighted.

Figure 9

Figure 9: Example detailed review for a highly rated novel, reflecting aspect guidance and alignment with human impressions.

Efficiency analysis confirms a major computational and monetary advantage for summary-based evaluation. For GPT-4o, five-run-average summary-based evaluation is an order of magnitude cheaper and faster relative to aggregation or incremental approaches, confirming its suitability as the practical default for this domain.

Implications, Limitations, and Future Directions

This study delivers a methodological and empirical advance in the automatic evaluation of long-form narratives, offering a scalable, high-fidelity dataset and modeling pipeline. The findings challenge the implicit assumption that LLM-based evaluation reliability extends trivially to long-form narratives; inconsistencies and bias are exacerbated at book scale, requiring architecture, data, and training adaptations not necessary for short-form NLG evaluation [stureborg2024large, (Yang et al., 2024)].

Ethical handling of reader reviews and protection of book copyright, as enforced by summary-only content release, ensures compliance with standard research norms [wan2019Fine].

Moving forward, reliable and interpretable aspect-based evaluation for long-form content is positioned to catalyze advances in narrative generation, personalized recommender systems, and creative AI co-authoring frameworks. Scalable pairwise or ranking-based protocols (including data-efficient/sampling variants [xu2024data]) offer further promise over direct numerical scoring in overcoming inconsistencies and calibration issues. Personalized evaluation—contingent on user history, genre preferences, and bias patterns—remains an open research target as more user metadata becomes available [wang2023perse].

Conclusion

This paper establishes the LongStoryEval dataset and a reference methodology for evaluating book-length stories with high fidelity to real reader standards. Through the introduction of a comprehensive aspect taxonomy, comparative study of scaling evaluation paradigms, and the NovelCritique model, the work both quantifies and addresses the limitations of LLM-based evaluators for long-form narratives. The demonstrated improvement over commercial LLMs and open-source baselines for aspect-guided, summary-based evaluation points to viable paths for robust, efficient evaluation in machine-assisted literary criticism, story generation validation, and NLG research generally.

Reference: "What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation" (2512.12839)

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

GitHub