Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints (2410.04197v1)

Published 5 Oct 2024 in cs.CL

Abstract: Evaluating the creativity of LLMs in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 ($\mathbf{C}$omparing the $\mathbf{S}$kill of $\mathbf{C}$reating $\mathbf{S}$tories by $\mathbf{C}$ontrolling the $\mathbf{S}$ynthesized $\mathbf{C}$onstraint $\mathbf{S}$pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs' creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs' ability to produce creative stories that are unseen in the training corpora. The benchmark is released at https://github.com/anirudhlakkaraju/cs4_benchmark.

Summary

  • The paper introduces the CS4 benchmark, a framework that quantitatively measures creative storytelling in LLMs by varying prompt constraints.
  • It employs both instruction- and story-based constraint synthesis with metrics like QUC and RCS to evaluate narrative quality and diversity.
  • Experiments demonstrate that higher prompt specificity challenges narrative coherence and highlights limitations in LLM-derived originality despite LHF support.

Assessing Creativity in LLMs: The CS4^4 Benchmark

The paper "CS4^4: Measuring the Creativity of LLMs Automatically by Controlling the Number of Story-Writing Constraints" introduces an innovative framework for evaluating the creativity of LLMs in the domain of story generation. The paper is predicated on the hypothesis that heightened prompt specificity can obstruct the tendency of LLMs to regurgitate existing narratives from their training data, thereby providing a more accurate measure of their creative capabilities.

Methodology Overview

The authors propose the CS4^4 benchmark, which stands for "Comparing the Skill of Creating Stories by Controlling the Synthesized Constraint Specificity." The central concept is to introduce constraints within prompts and examine how LLMs compose narratives under these conditions. Constraints range from basic to highly specific, with the number increasing up to 39. By varying constraint levels, the authors aim to indirectly gauge creativity without relying on human annotations, reducing evaluative bias and expense.

Two approaches for constraint synthesis are employed: instruction-based and story-based. Both employ GPT-4 to generate constraints, requiring that they remain non-trivial yet satisfiable. For story generation, a base narrative is crafted using GPT-4, and various LLMs are tasked to adapt this story to meet the constraints. Evaluation is conducted using metrics such as constraint satisfaction, coherence, diversity, and novel creativity-specific measures labeled Quality Under Constraints (QUC) and Relative Creativity Score (RCS).

Key Findings

Through experiments on models such as LLaMA, Gemma, and Mistral, the paper demonstrates varied performance outcomes across LLMs when constraints are introduced. Notably, the paper reveals that:

  • Impact of Specificity: Higher prompt specificity generally poses significant challenges for LLMs, revealing gaps in their narrative originality and coherence.
  • Coherence vs. Constraint Satisfaction: The tension between maintaining narrative coherence and adhering to constraints emerges starkly as constraints increase.
  • Role of Learning from Human Feedback (LHF): While LHF assists in selecting superior narratives from training data, its utility diminishes with more constraints, suggesting a limited role in enhancing genuine creativity.

Implications for AI and Future Directions

The implications of this research are profound for both theoretical understanding and practical application. By delineating the bounds of LLM-derived creativity, the paper provides key insights into the capabilities and limitations of current models. This has practical relevance for industries reliant on narrative creation, such as publishing and entertainment, where bespoke and intricate storytelling is paramount.

Looking forward, future developments could explore automating the constraint design process and extending the methodology to other creative domains. Furthermore, the paper indicates a necessity for innovations in LLM training methods that enhance intrinsic creativity rather than memorization capacity.

Conclusion

The CS4^4 benchmark represents a significant step towards a nuanced understanding of LLMs' creative output. By isolating creativity through constraint manipulation, the paper offers a novel perspective in evaluating AI models, positioning CS4^4 as a valuable tool for researchers and developers seeking to augment the narrative dexterity of LLMs.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com