Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers (2403.01061v3)

Published 2 Mar 2024 in cs.CL

Abstract: We evaluate recent LLMs on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.

Evaluating LLMs on the Subtle Task of Short Story Summarization: A Study with Unseen Data and Expert Writers

Introduction

Short story summarization presents a unique challenge for LLMs due to the inherent complexity of narrative structures, which can include nuanced subtext, non-linear timelines, and a mix of abstract and concrete details. Recognizing this, the paper "Reading Subtext: Evaluating LLMs on Short Story Summarization with Writers" seeks to understand how well current LLMs—specifically GPT-4, Claude-2.1, and LLama-2-70B—perform in summarizing short stories that are complex and have not been previously shared online, ensuring these texts are unseen by the models prior to evaluation.

Methodology

The authors' approach revolves around collaborating directly with experienced writers to use unpublished short stories as test cases, thereby ensuring the stories are not in the models' training data. This approach not only maintains the integrity and originality of the data but also leverages expert human judgments for evaluation. The paper involves quantitative and qualitative assessments, examining models' performances across coherence, faithfulness, coverage, and analysis—a novel inclusion highlighting the importance of thematic understanding in summarization tasks. Additionally, the paper innovates in its methodological framework by replacing conventional LLM judgments with skilled human evaluations to assess summary quality, providing a robust critique of current automatic evaluation methodologies.

Key Findings

The findings reveal a mixed performance by the evaluated LLMs. While all models exhibit a tendency to make over 50% faithfulness errors and struggle with interpreting complex subtext, they also show the capacity for insightful thematic analysis at their best. Specifically, GPT-4 emerges as the most capable, followed closely by Claude-2.1, with LLama-2-70B lagging in its ability to summarize effectively, particularly for longer stories. The evaluation highlights the significant disparity between LLM-generated judgments of summary quality and those provided by the writers, underscoring the inadequacy of LLMs in replacing human expertise in nuanced tasks like narrative summarization.

Implications and Future Directions

This paper underlines several critical areas for future research, especially in improving LLMs’ understanding of narrative structures and subtext. The pronounced difficulty in summarizing stories with complex narratives, unreliable narrators, or detailed subplots suggests a need for models that can better grasp the subtleties of human storytelling. Furthermore, the mismatch between LLM and human evaluations of summaries prompts a reevaluation of current summary quality metrics, advocating for more human-centered approaches in assessing narrative understanding.

Moreover, the research methodology adopted here, specifically the direct engagement with creative communities and the use of unpublished stories, offers a valuable template for future studies aiming to challenge LLMs with genuinely unseen data. Such collaborations not only enrich the dataset diversity but also ensure a more contextually informed evaluation of LLM performance, a critical step toward models that can genuinely understand and generate human-like narrative content.

Conclusion

In conclusion, "Reading Subtext: Evaluating LLMs on Short Story Summarization with Writers" provides an insightful exploration into the current capabilities and limitations of LLMs in the complex task of narrative summarization. By leveraging expert human judgments and ensuring the use of unseen, nuanced narrative texts, the paper presents an instructive foray into understanding the depth of narrative comprehension achievable by current LLM technologies. As the field progresses, bridging the identified gaps and continuing to refine models’ narrative understanding will be crucial for advancements in AI-generated narrative content.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Nina Begus. 2023. Experimental narratives: A comparison of human crowdsourced storytelling and ai storytelling. arXiv preprint arXiv:2310.12902.
  2. Wayne C Booth. 1983. The rhetoric of fiction. University of Chicago Press.
  3. Art or artifice? large language models and the false promise of creativity. arXiv preprint arXiv:2309.14556.
  4. Creativity support in the age of large language models: An empirical study involving emerging writers. arXiv preprint arXiv:2309.12570.
  5. Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing. arXiv preprint arXiv:2210.13669.
  6. Booookscore: A systematic exploration of book-length summarization in the era of llms. arXiv preprint arXiv:2310.00785.
  7. Summscreen: A dataset for abstractive screenplay summarization. arXiv preprint arXiv:2104.07091.
  8. Evaluation of african american language bias in natural language generation. arXiv preprint arXiv:2305.14291.
  9. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  10. Qafacteval: Improved qa-based factual consistency evaluation for summarization. arXiv preprint arXiv:2112.08542.
  11. Gérard Genette. 1980. Narrative discourse: An essay in method, volume 3. Cornell University Press.
  12. FALTE: A toolkit for fine-grained annotation for long text evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 351–358, Abu Dhabi, UAE. Association for Computational Linguistics.
  13. Snac: Coherence error detection for narrative summarization. arXiv preprint arXiv:2205.09641.
  14. News summarization and evaluation in the era of gpt-3.
  15. Jessica A Grieser. 2022. The Black side of the river: Race, language, and belonging in Washington, DC. Georgetown University Press.
  16. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283.
  17. David Herman. 2009. Basic elements of narrative. John Wiley & Sons.
  18. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  19. Inspo: Writing stories with a flock of ais and humans. arXiv preprint arXiv:2311.16521.
  20. Creative writing with an ai-powered writing assistant: Perspectives from professional writers.
  21. A comprehensive evaluation of large language models on benchmark biomedical text processing tasks. Computers in Biology and Medicine, page 108189.
  22. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel.
  23. Longeval: Guidelines for human evaluation of faithfulness in long-form summarization. arXiv preprint arXiv:2301.13298.
  24. Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209.
  25. Exploring content selection in summarization of novel chapters. arXiv preprint arXiv:2005.01840.
  26. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. arXiv preprint arXiv:2311.09184.
  27. Unveiling the essence of poetry: Introducing a comprehensive dataset and benchmark for poem summarization. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  28. Jean M Mandler and Nancy S Johnson. 1977. Remembrance of things parsed: Story structure and recall. Cognitive psychology, 9(1):111–151.
  29. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
  30. Toni Morrison. 2004. Beloved. 1987. New York: Vintage.
  31. OpenAI. 2023. Gpt-4 technical report.
  32. Does writing with language models reduce content diversity? arXiv preprint arXiv:2309.05196.
  33. Alison H Paris and Scott G Paris. 2003. Assessing narrative comprehension in young children. Reading Research Quarterly, 38(1):36–76.
  34. Narrative theory for computational narrative understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 298–311.
  35. Summarization is (almost) dead. arXiv preprint arXiv:2309.09558.
  36. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. arXiv preprint arXiv:2205.12854.
  37. Tofueval: Evaluating hallucinations of llms on topic-focused dialogue summarization.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  39. Squality: Building a long-document summarization dataset the hard way. arXiv preprint arXiv:2205.11465.
  40. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
  41. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. arXiv preprint arXiv:2402.06044.
  42. Fantastic questions and where to find them: Fairytaleqa–an authentic dataset for narrative comprehension. arXiv preprint arXiv:2203.13947.
  43. Ghostwriter: Augmenting collaborative human-ai writing experiences through personalization and agency. arXiv preprint arXiv:2402.08855.
  44. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852.
  45. Mug: A general meeting understanding and generation benchmark. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  46. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848.
  47. Fiction-writing mode: An effective control for human-machine collaborative writing. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1744–1757.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Melanie Subbiah (11 papers)
  2. Sean Zhang (6 papers)
  3. Lydia B. Chilton (26 papers)
  4. Kathleen McKeown (85 papers)
Citations (8)