Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Art or Artifice? Large Language Models and the False Promise of Creativity (2309.14556v3)

Published 25 Sep 2023 in cs.CL, cs.AI, and cs.HC

Abstract: Researchers have argued that LLMs exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.

This paper, titled "Art or Artifice? LLMs and the False Promise of Creativity," by Chakrabarty et al., investigates the creative writing capabilities of LLMs. It introduces a new evaluation framework called the Torrance Test of Creative Writing (TTCW), inspired by the Torrance Tests of Creative Thinking (TTCT), a well-established method for evaluating creativity as a process. The TTCW, however, evaluates creativity as a product, specifically short stories.

Here's a breakdown of the key aspects:

  1. Motivation: While LLMs have shown impressive writing abilities, objectively evaluating the creativity of that writing is difficult. Existing research often focuses on fluency and coherence, but not necessarily the core aspects of creative writing. This paper aims to address this gap.
  2. TTCW Framework: The TTCW is based on four core dimensions of creativity from the original TTCT:
    • Fluency: The quantity and flow of ideas (e.g., narrative pacing, coherence, use of literary devices). The TTCW includes five tests for Fluency.
    • Flexibility: The ability to shift perspectives and consider different viewpoints (e.g., perspective/voice flexibility, emotional flexibility, structural flexibility). The TTCW has three tests for Flexibility.
    • Originality: The novelty and uniqueness of ideas (e.g., originality in theme/content, thought, and form). The TTCW defines three tests for Originality.
    • Elaboration: The depth and detail provided (e.g., world-building, character development, rhetorical complexity). The TTCW uses three tests for Elaboration.

    The TTCW consists of 14 binary (Yes/No) tests, each aligned with one of these dimensions. Each test is designed to be answered with a "Yes" (passes the test) or "No" (fails the test), accompanied by a written justification.

  3. Formative Study (Developing the TTCW): The authors recruited eight creative writing experts (professors, MFA students, published authors, screenwriters) to propose measures for evaluating short stories aligned with the four Torrance dimensions. This resulted in 126 initial measures, which were then consolidated into the 14 TTCW tests using a qualitative inductive approach, with input from a novelist and creative writing professor.

  4. Design Principles: The TTCW is designed around four key principles:

    • Leveraging Torrance Test Metrics: Grounded in the four dimensions of the TTCT.
    • Artifact-centric Testing: Focuses on the final written product (the story) rather than the writing process.
    • Binary Questions with Open-Ended Rationales: Uses Yes/No questions for quantitative analysis, paired with justifications for qualitative insights.
    • Additive Nature of Tests: Creativity is assessed by the number of tests passed, not by any single test. All 14 tests should be considered together.
  5. Experimental Validation (Implementation with Experts as Assessors):
    • Data: The authors created a dataset of 48 short stories: 12 from The New Yorker (written by professional authors) and 36 generated by LLMs (GPT-3.5, GPT-4, and Claude 1.3). The LLM-generated stories were based on one-sentence plot summaries of the New Yorker stories, ensuring similar length and plot, to isolate the evaluation of creative writing from plot originality.
    • Participants: A new group of 10 creative writing experts (different from the formative paper group) were recruited to evaluate the stories.
    • Protocol: Each expert evaluated groups of four stories (one New Yorker story and three LLM-generated stories, anonymized and shuffled). They administered the 14 TTCW tests for each story, providing Yes/No answers and justifications. They also ranked the stories by preference and guessed the author (experienced writer, amateur writer, or AI). Each story group was evaluated by three different experts.
    • Research Questions and Results:
      • RQ1 (Pass Rates): New Yorker stories passed significantly more TTCW tests (84.7% on average) than LLM-generated stories (8.7% for GPT-3.5, 27.9% for GPT-4, and 30.0% for Claude 1.3). This indicates a substantial gap in evaluated creativity.
      • RQ2 (Reproducibility): Experts showed moderate agreement on individual tests (Fleiss Kappa 0.41) but strong agreement on the aggregate score (Pearson correlation 0.69), supporting the additive nature of the tests.
      • RQ3 (LLM Performance): Claude 1.3 performed slightly better than GPT-4 and GPT-3.5 overall, particularly in Fluency, Flexibility, and Elaboration. GPT-4 performed best on Originality.
  6. Analysis of Expert Explanations: The authors analyzed the justifications provided by the experts to identify common themes for passing and failing each test. This provides qualitative insights into why stories succeeded or failed. For example failing in originality in thought was often caused by the use of cliche.
  7. Implementation with LLMs as Assessors: The authors tested whether LLMs (GPT-3.5, GPT-4, and Claude) could administer the TTCW tests themselves. They provided the LLMs with the stories and expanded versions of the test questions, prompting for chain-of-thought reasoning. The results showed no significant correlation between LLM assessments and expert assessments (Cohen's Kappa close to zero). This suggests LLMs are currently not capable of reliably evaluating creative writing using the TTCW.
  8. How Experts Differentiated Human vs. AI Stories: The expert responses showed their decisions were rooted in creative nuances, not superficial markers. They noted AI tendencies like: poor narrative endings (forestalling/getting bigger in scope), abstruse/cliched metaphors (poor language proficiency), lack of subtext (poor Rhetorical Complexity), underdeveloped/inconsistent characters, unusual syntax, and repetition.
  9. Discussion: The authors discuss the implications of their findings, including the potential use of TTCW in future interactive writing support tools, the limitations of the paper (e.g., potential biases in the expert pool, focus on short fiction), and the challenges of defining "expert" and "amateur" writers. They also reflect on the use of LLMs as a research tool in their own work.
  10. Contributions: The main contributions are:
    • The development of the TTCW, a novel evaluation framework for creative writing, grounded in established creativity research.
    • Empirical validation of the TTCW, demonstrating its consistency and reproducibility.
    • A comparative analysis of human-written and LLM-generated stories, revealing a significant creativity gap.
    • An investigation into LLMs' ability to assess creativity, finding them currently inadequate.
    • Release of the annotated dataset of 2,000+ TTCW assessments with expert justifications.

In essence, the paper presents a rigorous framework for evaluating the creative aspects of writing, demonstrates a clear gap between human and LLM capabilities in this area, and shows that LLMs are not yet capable of reliably assessing creative writing, even when provided with a structured framework like the TTCW.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Muhammad M Mahmoud Abdel Latif. 2013. What do we mean by writing fluency and how can it be validly measured? Applied linguistics 34, 1 (2013), 99–105.
  2. Joan Accocela. 2012. On Bad Endings. The NewYorker (2012). https://www.newyorker.com/books/page-turner/on-bad-endings
  3. Teresa M Amabile. 1982. Social psychology of creativity: A consensual assessment technique. Journal of personality and social psychology 43, 5 (1982), 997.
  4. Anthropic. 2022. Introducing Claude. (2022). https://www.anthropic.com/index/introducing-claude
  5. John Baer. 2014. Creativity and divergent thinking: A task-specific approach. Psychology Press.
  6. John Baer and Sharon S McKool. 2009. Assessing creativity using the consensual assessment technique. In Handbook of research on assessment technologies, methods, and applications in higher education. IGI Global, 65–77.
  7. Roger E Beaty and Dan R Johnson. 2021. Automating creativity assessment with SemDis: An open platform for computing semantic distance. Behavior research methods 53, 2 (2021), 757–780.
  8. Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 313–322.
  9. John B Biggs and Kevin F Collis. 1982. The psychological structure of creative writing. Australian Journal of Education 26, 1 (1982), 59–70.
  10. Michael M Boardman. 1992. Narrative Innovation and Incoherence: Ideology in Defoe, Goldsmith, Austen, Eliot, and Hemingway. Duke University Press.
  11. When Design Novices and LEGO® Meet: Stimulating Creative Thinking for Interface Design. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3313831.3376495
  12. Writing fiction: A guide to narrative craft. University of Chicago Press.
  13. Rüdiger Campe and Julia Weber. 2014. Rethinking Emotion: Interiority and Exteriority in Premodern, Modern, and Contemporary Thought. Vol. 15. Walter de Gruyter GmbH & Co KG.
  14. How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023).
  15. Ambient Adventures: Teaching ChatGPT on Developing Complex Stories. arXiv preprint arXiv:2308.01734 (2023).
  16. The Intersection of Users, Roles, Interactions, and Technologies in Creativity Support Tools. In Proceedings of the 2021 ACM Designing Interactive Systems Conference (Virtual Event, USA) (DIS ’21). Association for Computing Machinery, New York, NY, USA, 1817–1833. https://doi.org/10.1145/3461778.3462050
  17. All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 7282–7296. https://doi.org/10.18653/v1/2021.acl-long.565
  18. Roy Peter Clark. 2008. Writing tools: 55 essential strategies for every writer. Little, Brown Spark.
  19. Gregory Currie. 1990. The nature of fiction. Cambridge University Press.
  20. Mark Doty. 2014. The art of description: World into word. Graywolf Press.
  21. David Fishelov. 1990. Types of character, characteristics of types. Style (1990), 422–439.
  22. Linda Flower and John R Hayes. 1981. A cognitive process theory of writing. College composition and communication 32, 4 (1981), 365–387.
  23. Edward Morgan Forster. 1927. Aspects of the Novel. Harcourt, Brace.
  24. Nigel Fountain. 2012. Clichés: Avoid them like the plague. Michael O’Mara Books.
  25. Norman Friedman. 1955. Point of view in fiction: the development of a critical concept. PMlA 70, 5 (1955), 1160–1184.
  26. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554 (2023).
  27. Social Dynamics of AI Support in Creative Writing. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 245, 15 pages. https://doi.org/10.1145/3544548.3580782
  28. Content Planning for Neural Story Generation with Aristotelian Rescoring. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4319–4338. https://doi.org/10.18653/v1/2020.emnlp-main.351
  29. Joy Paul Guilford. 1967. The nature of human intelligence. (1967).
  30. Norman Norwood Holland. 2009. Literature and the Brain. PsyArt Foundation.
  31. Creative writing with an ai-powered writing assistant: Perspectives from professional writers. arXiv preprint arXiv:2211.05030 (2022).
  32. Fredric Jameson. 1991. Postmodernism, or, the cultural logic of late capitalism. Duke university press.
  33. The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 1265–1285. https://doi.org/10.18653/v1/2021.emnlp-main.97
  34. Essentials of creativity assessment. John Wiley & Sons.
  35. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work. 1301–1318.
  36. Maria Kochis. 2007. Baxter, Charles. The Art of Subtext: Beyond Plot. Library Journal 132, 14 (2007), 135–136.
  37. LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond. arXiv preprint arXiv:2305.14540 (2023).
  38. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 388, 19 pages. https://doi.org/10.1145/3491102.3502030
  39. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634 (2023).
  40. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).
  41. Michael S. Matell and Jacob Jacoby. 1971. Is There an Optimal Number of Alternatives for Likert Scale Items? Study I: Reliability and Validity. Educational and Psychological Measurement 31, 3 (1971), 657–674. https://doi.org/10.1177/001316447103100307 arXiv:https://doi.org/10.1177/001316447103100307
  42. Tim Mayers. 2007. (Re) Writing craft: composition, creative writing, and the future of English studies. University of Pittsburgh Pre.
  43. Individual characteristics and creativity in the marketing classroom: Exploratory insights. Journal of Marketing Education 25, 2 (2003), 143–149.
  44. Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 355, 34 pages. https://doi.org/10.1145/3544548.3581225
  45. Donald M Murray. 2012. The craft of revision. Cengage Learning.
  46. WearWrite: Crowd-assisted writing from smartwatches. In Proceedings of the 2016 CHI conference on human factors in computing systems. 3834–3846.
  47. Collaborative Storytelling with Large-Scale Neural Language Models. In Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games (Virtual Event, SC, USA) (MIG ’20). Association for Computing Machinery, New York, NY, USA, Article 17, 10 pages. https://doi.org/10.1145/3424636.3426903
  48. Martha Nussbaum. 1997. Poetic justice: The literary imagination and public life. Beacon Press.
  49. OpenAI. 2022. ChatGT: Optimizing language models for dialogue. (2022). https://openai.com/blog/chatgpt/
  50. OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  51. James Phelan. 1996. Narrative as rhetoric: Technique, audiences, ethics, ideology. Ohio State University Press.
  52. Assessment of creativity. The Cambridge handbook of creativity (2010), 48–73.
  53. Can foundation models label data like humans? Hugging Face Blog (2023). https://huggingface.co/blog/llm-leaderboard.
  54. PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4274–4295. https://doi.org/10.18653/v1/2020.emnlp-main.349
  55. Alicia Rodríguez. 2008. The ‘problem’of creative writing: using grading rubrics based on narrative theory as solution. New Writing 5, 3 (2008), 167–177.
  56. Melissa Roemmele and Andrew Gordon. 2018a. Linguistic Features of Helpfulness in Automated Support for Creative Writing. In Proceedings of the First Workshop on Storytelling. Association for Computational Linguistics, New Orleans, Louisiana, 14–19. https://doi.org/10.18653/v1/W18-1502
  57. Melissa Roemmele and Andrew S Gordon. 2018b. Automated assistance for creative writing with an rnn language model. In Proceedings of the 23rd international conference on intelligent user interfaces companion. 1–2.
  58. Assessing creativity with divergent thinking tasks: exploring the reliability and validity of new subjective scoring methods. Psychology of Aesthetics, Creativity, and the Arts 2, 2 (2008), 68.
  59. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
  60. David R Thomas. 2006. A general inductive approach for analyzing qualitative evaluation data. American journal of evaluation 27, 2 (2006), 237–246.
  61. Ellis Paul Torrance. 1966. Torrance tests of creative thinking: Norms-technical manual: Verbal tests, forms a and b: Figural tests, forms a and b. Personal Press, Incorporated.
  62. Development of Torrance test creativity thinking (TTCT) instrument in science learning. In AIP Conference Proceedings, Vol. 2194. AIP Publishing.
  63. Maryam Vaezi and Saeed Rezaei. 2019. Development of a rubric for evaluating creative writing: a multi-phase research. New Writing 16, 3 (2019), 303–317.
  64. Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. arXiv preprint arXiv:2306.07899 (2023).
  65. Large Language Models Enable Few-Shot Clustering. arXiv preprint arXiv:2307.00524 (2023).
  66. Creative cognition. Handbook of creativity 189 (1999), 212.
  67. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  68. Sara Cushing Weigle. 2002. Assessing writing. Cambridge University Press.
  69. DOC: Improving Long Story Coherence With Detailed Outline Control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 3378–3465. https://doi.org/10.18653/v1/2023.acl-long.190
  70. Re3: Generating Longer Stories With Recursive Reprompting and Revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 4393–4479. https://doi.org/10.18653/v1/2022.emnlp-main.296
  71. Plan-and-Write: Towards Better Automatic Storytelling. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (Honolulu, Hawaii, USA) (AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 906, 8 pages. https://doi.org/10.1609/aaai.v33i01.33017378
  72. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces. 841–852.
  73. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tuhin Chakrabarty (33 papers)
  2. Philippe Laban (40 papers)
  3. Divyansh Agarwal (15 papers)
  4. Smaranda Muresan (47 papers)
  5. Chien-Sheng Wu (77 papers)
Citations (68)
Github Logo Streamline Icon: https://streamlinehq.com