Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing? (2407.01119v2)

Published 1 Jul 2024 in cs.CL and cs.AI

Abstract: It has become routine to report research results where LLMs outperform average humans in a wide range of language-related tasks, and creative text writing is no exception. It seems natural, then, to raise the bid: Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent's. Then, we prepared an evaluation rubric inspired by Boden's definition of creativity, and we collected 5,400 manual assessments provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer, and that reaching such level of autonomous creative writing skills probably cannot be reached simply with larger LLMs.

Authors (4)

Guillermo Marco (5 papers)
Julio Gonzalo (11 papers)
Ramón del Castillo (1 paper)
María Teresa Mateo Girona (1 paper)

Citations (4)

View on Semantic Scholar

Summary

Analysis of "Pron vs Prompt: Can LLMs Already Challenge a World-Class Fiction Author at Creative Text Writing?"

The paper, authored by Guillermo Marco, Julio Gonzalo, Ram on del Castillo, and María Teresa Mateo Girona, proposes an intriguing empirical paper that investigates the creative abilities of LLMs in comparison with a world-renowned human writer, Patricio Pron. This essay provides an expert analysis of the paper, spotlighting the significant findings and their implications for the field of AI and creative writing.

Research Questions and Methodology

The research aims to answer several critical questions:

Competitiveness of LLMs with Top Human Authors: Can current LLMs match the creative writing skills of distinguished human authors?
Impact of Prompts on Creativity: How does the origin of the prompt affect the creativity of the generated text?
Multilingual Creativity Assessment: Are LLMs less proficient in creative writing in languages other than English?
Recognizability of LLM-Generated Texts: Do LLMs produce text with a recognizable style that literary experts can identify?
Operational Validity of Boden's Creativity Framework: Can Boden's creativity framework be effectively used to measure AI-generated text?

To address these questions, the authors designed a contest between GPT-4 Turbo and the acclaimed novelist Patricio Pron. Both were asked to generate 30 movie titles and write synopses for all 60 titles. Each synopsis was manually assessed by literature critics and scholars, resulting in a robust dataset of 5,400 evaluations using a rubric based on Boden's definition of creativity, focusing on dimensions like novelty, surprise, and value.

Findings and Discussions

Competitiveness of LLMs with Top Human Authors

Results from the paper indicate a substantial gap between the creative abilities of GPT-4 Turbo and Patricio Pron. Pron consistently received higher scores across all measured dimensions, indicating that LLMs are not yet at a level where they can compete with world-class writers in creative writing tasks. This aligns with the broader understanding that while LLMs have achieved capabilities paralleling average human performances, competing with top-tier human creativity remains elusive.

Impact of Prompts on Creativity

The paper reveals that prompts significantly influence the creativity of LLM-generated content. Synopses generated by GPT-4 for titles provided by Pron received higher ratings compared to those generated for titles it created itself. This highlights the pivotal role of human-provided prompts in enhancing the creative outputs of LLMs and suggests that human-AI collaborative writing might yield more creative results than autonomous AI writing.

Multilingual Creativity Assessment

The performance of GPT-4 was notably better in English than in Spanish, which suggests that the model's vast training on predominantly English-language data results in superior creative writing capabilities in English. The authors used the Wilcoxon signed-rank test to compare the differences, finding significant disparities in style attractiveness, originality, and overall creativity.

Recognizability of LLM-Generated Texts

Expert evaluators were increasingly able to distinguish between AI-generated and human-written texts as the evaluation progressed, suggesting that GPT-4 produces text with consistent, recognizable patterns. This insight is critical for understanding the limitations of current LLMs in producing truly indistinguishable human-like creative writing.

Operational Validity of Boden's Creativity Framework

The paper effectively utilized Boden’s framework to evaluate creativity, and statistical analysis confirmed the strong correlation between attributed creativity and dimensions of originality and attractiveness. Mixed-effects models highlighted that originality had a more significant impact on perceived creativity than attractiveness, validating the application of Boden's principles in AI-generated creative text evaluation.

Implications and Future Directions

The findings of this paper have significant theoretical and practical implications:

Theoretical Implications: The substantial gap in creativity between LLMs and top human writers suggests limitations inherent in the probabilistic nature of LLM text generation. This points to a need for a more nuanced approach that incorporates elements of human creativity beyond data-driven patterns.
Practical Implications: For practical applications, these results advocate for human-AI collaboration in creative industries, where AI-generated content could benefit from human refinement in terms of prompting and editing to enhance creativity and originality.

Future research could benefit from exploring other models and architectures and expanding the scope of creative writing tasks beyond synopses to include more extensive and varied literary forms. Moreover, assessing the perspectives of general readers as well as literary experts could provide a more holistic understanding of AI's role in creative writing.

Conclusion

The paper provides comprehensive evidence that current state-of-the-art LLMs, like GPT-4 Turbo, fall short of competing with top human authors in creative writing tasks. The influence of prompts and the language bias in LLM training data are critical factors affecting AI creativity. This paper is a foundational step in understanding the dynamics of human-AI interaction in creative fields and sets the stage for future research to bridge the observed gaps.