Benchmarking Language Model Creativity: A Case Study on Code Generation (2407.09007v1)

Published 12 Jul 2024 in cs.CL

Abstract: As LLMs become increasingly prevalent, it is interesting to consider how ``creative'' these models can be. From cognitive science, creativity consists of at least two key characteristics: \emph{convergent} thinking (purposefulness to achieve a given goal) and \emph{divergent} thinking (adaptability to new environments or constraints) \citep{runco2003critical}. In this work, we introduce a framework for quantifying LLM creativity that incorporates the two characteristics. This is achieved by (1) Denial Prompting pushes LLMs to come up with more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies, and (2) defining and computing the NeoGauge metric which examines both convergent and divergent thinking in the generated creative responses by LLMs. We apply the proposed framework on Codeforces problems, a natural data source for collecting human coding solutions. We quantify NeoGauge for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NeoCoder dataset for reproducing our results on future models.

PDF HTML Abstract

Benchmarking LLM Creativity: A Case Study on Code Generation

The paper "Benchmarking LLM Creativity: A Case Study on Code Generation" by Yining Lu et al. introduces a novel framework and evaluation methodology for quantifying the creativity of LLMs. By addressing the dual facets of cognitive creativity, namely, convergent and divergent thinking, the paper proposes a tailored approach to eliciting and measuring creative outputs from LLMs, specifically in the context of code generation challenges from Codeforces.

Framework and Methodology

The authors pioneer a dual-component framework to quantify LLM creativity:

Denial Prompting: A systematic prompting method where the LLM is asked to solve coding problems under progressively restrictive constraints. By iteratively prohibiting the use of specific programming techniques used in the previous solution (e.g., for-loops, if-statements), Denial Prompting coerces the model to explore less conventional solutions, thus stimulating out-of-the-box thinking.
NeoGauge Metric: A comprehensive creativity score that evaluates both convergent and divergent thinking. Convergent thinking is assessed based on the correctness and adherence to constraints of the generated solutions, while divergent thinking is evaluated by comparing the novelty of the techniques used in the solutions against a historical set of human-contributed solutions.

Experimentation and Dataset

The framework is employed to examine multiple state-of-the-art LLMs, including proprietary models such as GPT-4, GPT-3.5, and Claude-3, as well as open-source models like Llama3-70B and CodeGemma-7B. The evaluation uses a newly constructed dataset, NeoCoder, which consists of 199 recent Codeforces problems and 30 curated human solutions per problem. This dataset, along with the accompanying constraints formulated through Denial Prompting, serves as a benchmark for future research on machine creativity.

Key Findings

Performance Disparity: GPT-4 demonstrated superior overall creativity (NeoGauge) relative to other models, especially in more complex, constraint-heavy scenarios. While smaller models exhibited some capability in generating novel techniques, their solutions often lacked correctness and adherence to constraints, highlighting the necessity of a balanced approach to evaluating creativity.
Human-Like Creativity: Despite advancements, even the most capable LLMs, including GPT-4, fell short of human-level creativity. The paper reports that humans significantly outperform LLMs in convergent thinking, as evidenced by higher correctness and constraint-following ratios in their solutions.
Divergent Thinking Challenges: The divergent thinking scores of LLMs, although reasonable, suggest that models frequently generate novel but impractical or incorrect solutions. This underscores the need for more robust mechanisms to guide LLMs in producing functional, innovative solutions.

Reasoning Strategies and Creativity

The paper also evaluates several reasoning strategies – including MCTS, self-correction, planning, and sampling – to understand their impact on creativity:

MCTS showed the best improvement in divergent creativity but did not enhance overall creativity due to a decline in the correctness of solutions.
Self-Correction and Planning significantly improved convergent thinking but had negligible or even negative effects on divergent thinking.
Sampling showed minimal impact on both aspects of creativity, suggesting a lack of substantial benefit from additional generated samples under the evaluated conditions.

Implications and Future Work

The implications of this work are multifaceted, providing both practical and theoretical insights:

Practical: The introduction of NeoGauge and the NeoCoder dataset sets a new standard for evaluating LLM creativity, aiding developers in identifying and addressing weaknesses in model outputs.
Theoretical: The detailed analysis of LLMs versus human creativity provides a basis for future research into enhancing machine creativity, potentially guiding the development of new architectures or training methodologies aimed at better mimicking human creative processes.

The paper opens several avenues for further investigation. Future research may focus on exploring more sophisticated reasoning strategies to boost both convergent and divergent thinking, examining the impact of larger and more diverse datasets, or tailoring pre-training objectives specifically for creativity enhancement.

In summary, this paper makes a significant contribution to the field of AI by providing a rigorous and nuanced framework for assessing and improving the creative capabilities of LLMs, particularly in the domain of code generation.