This paper, titled "Art or Artifice? LLMs and the False Promise of Creativity," by Chakrabarty et al., investigates the creative writing capabilities of LLMs. It introduces a new evaluation framework called the Torrance Test of Creative Writing (TTCW), inspired by the Torrance Tests of Creative Thinking (TTCT), a well-established method for evaluating creativity as a process. The TTCW, however, evaluates creativity as a product, specifically short stories.
Here's a breakdown of the key aspects:
- Motivation: While LLMs have shown impressive writing abilities, objectively evaluating the creativity of that writing is difficult. Existing research often focuses on fluency and coherence, but not necessarily the core aspects of creative writing. This paper aims to address this gap.
- TTCW Framework: The TTCW is based on four core dimensions of creativity from the original TTCT:
- Fluency: The quantity and flow of ideas (e.g., narrative pacing, coherence, use of literary devices). The TTCW includes five tests for Fluency.
- Flexibility: The ability to shift perspectives and consider different viewpoints (e.g., perspective/voice flexibility, emotional flexibility, structural flexibility). The TTCW has three tests for Flexibility.
- Originality: The novelty and uniqueness of ideas (e.g., originality in theme/content, thought, and form). The TTCW defines three tests for Originality.
- Elaboration: The depth and detail provided (e.g., world-building, character development, rhetorical complexity). The TTCW uses three tests for Elaboration.
The TTCW consists of 14 binary (Yes/No) tests, each aligned with one of these dimensions. Each test is designed to be answered with a "Yes" (passes the test) or "No" (fails the test), accompanied by a written justification.
Formative Study (Developing the TTCW): The authors recruited eight creative writing experts (professors, MFA students, published authors, screenwriters) to propose measures for evaluating short stories aligned with the four Torrance dimensions. This resulted in 126 initial measures, which were then consolidated into the 14 TTCW tests using a qualitative inductive approach, with input from a novelist and creative writing professor.
Design Principles: The TTCW is designed around four key principles:
- Leveraging Torrance Test Metrics: Grounded in the four dimensions of the TTCT.
- Artifact-centric Testing: Focuses on the final written product (the story) rather than the writing process.
- Binary Questions with Open-Ended Rationales: Uses Yes/No questions for quantitative analysis, paired with justifications for qualitative insights.
- Additive Nature of Tests: Creativity is assessed by the number of tests passed, not by any single test. All 14 tests should be considered together.
- Experimental Validation (Implementation with Experts as Assessors):
- Data: The authors created a dataset of 48 short stories: 12 from The New Yorker (written by professional authors) and 36 generated by LLMs (GPT-3.5, GPT-4, and Claude 1.3). The LLM-generated stories were based on one-sentence plot summaries of the New Yorker stories, ensuring similar length and plot, to isolate the evaluation of creative writing from plot originality.
- Participants: A new group of 10 creative writing experts (different from the formative paper group) were recruited to evaluate the stories.
- Protocol: Each expert evaluated groups of four stories (one New Yorker story and three LLM-generated stories, anonymized and shuffled). They administered the 14 TTCW tests for each story, providing Yes/No answers and justifications. They also ranked the stories by preference and guessed the author (experienced writer, amateur writer, or AI). Each story group was evaluated by three different experts.
- Research Questions and Results:
- RQ1 (Pass Rates): New Yorker stories passed significantly more TTCW tests (84.7% on average) than LLM-generated stories (8.7% for GPT-3.5, 27.9% for GPT-4, and 30.0% for Claude 1.3). This indicates a substantial gap in evaluated creativity.
- RQ2 (Reproducibility): Experts showed moderate agreement on individual tests (Fleiss Kappa 0.41) but strong agreement on the aggregate score (Pearson correlation 0.69), supporting the additive nature of the tests.
- RQ3 (LLM Performance): Claude 1.3 performed slightly better than GPT-4 and GPT-3.5 overall, particularly in Fluency, Flexibility, and Elaboration. GPT-4 performed best on Originality.
- Analysis of Expert Explanations: The authors analyzed the justifications provided by the experts to identify common themes for passing and failing each test. This provides qualitative insights into why stories succeeded or failed. For example failing in originality in thought was often caused by the use of cliche.
- Implementation with LLMs as Assessors: The authors tested whether LLMs (GPT-3.5, GPT-4, and Claude) could administer the TTCW tests themselves. They provided the LLMs with the stories and expanded versions of the test questions, prompting for chain-of-thought reasoning. The results showed no significant correlation between LLM assessments and expert assessments (Cohen's Kappa close to zero). This suggests LLMs are currently not capable of reliably evaluating creative writing using the TTCW.
- How Experts Differentiated Human vs. AI Stories: The expert responses showed their decisions were rooted in creative nuances, not superficial markers. They noted AI tendencies like: poor narrative endings (forestalling/getting bigger in scope), abstruse/cliched metaphors (poor language proficiency), lack of subtext (poor Rhetorical Complexity), underdeveloped/inconsistent characters, unusual syntax, and repetition.
- Discussion: The authors discuss the implications of their findings, including the potential use of TTCW in future interactive writing support tools, the limitations of the paper (e.g., potential biases in the expert pool, focus on short fiction), and the challenges of defining "expert" and "amateur" writers. They also reflect on the use of LLMs as a research tool in their own work.
- Contributions: The main contributions are:
- The development of the TTCW, a novel evaluation framework for creative writing, grounded in established creativity research.
- Empirical validation of the TTCW, demonstrating its consistency and reproducibility.
- A comparative analysis of human-written and LLM-generated stories, revealing a significant creativity gap.
- An investigation into LLMs' ability to assess creativity, finding them currently inadequate.
- Release of the annotated dataset of 2,000+ TTCW assessments with expert justifications.
In essence, the paper presents a rigorous framework for evaluating the creative aspects of writing, demonstrates a clear gap between human and LLM capabilities in this area, and shows that LLMs are not yet capable of reliably assessing creative writing, even when provided with a structured framework like the TTCW.