Papers
Topics
Authors
Recent
2000 character limit reached

Benchmarking Language Model Creativity: A Case Study on Code Generation

Published 12 Jul 2024 in cs.CL | (2407.09007v2)

Abstract: As LLMs become increasingly prevalent, it is interesting to consider how ``creative'' these models can be. From cognitive science, creativity consists of at least two key characteristics: \emph{convergent} thinking (purposefulness to achieve a given goal) and \emph{divergent} thinking (adaptability to explore new environments or constraints) \citep{runco2003critical}. In this work, we introduce a framework for quantifying LLM creativity that incorporates the two design ingredients: (1) We introduce DENIAL PROMPTING which pushes LLMs to develop more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies. (2) We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses by LLMs. We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions. We quantify NEOGAUGE for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NEOCODER dataset for reproducing our results on future models.

Citations (4)

Summary

  • The paper introduces a dual-component framework using Denial Prompting and the NeoGauge metric to measure LLM creativity in code generation.
  • The paper shows that GPT-4 outperforms other models but still falls short of human ingenuity in convergent thinking.
  • The paper analyzes various reasoning strategies, highlighting trade-offs between divergent creativity and solution correctness.

Benchmarking LLM Creativity: A Case Study on Code Generation

The paper "Benchmarking LLM Creativity: A Case Study on Code Generation" by Yining Lu et al. introduces a novel framework and evaluation methodology for quantifying the creativity of LLMs. By addressing the dual facets of cognitive creativity, namely, convergent and divergent thinking, the study proposes a tailored approach to eliciting and measuring creative outputs from LLMs, specifically in the context of code generation challenges from Codeforces.

Framework and Methodology

The authors pioneer a dual-component framework to quantify LLM creativity:

  1. Denial Prompting: A systematic prompting method where the LLM is asked to solve coding problems under progressively restrictive constraints. By iteratively prohibiting the use of specific programming techniques used in the previous solution (e.g., for-loops, if-statements), Denial Prompting coerces the model to explore less conventional solutions, thus stimulating out-of-the-box thinking.
  2. NeoGauge Metric: A comprehensive creativity score that evaluates both convergent and divergent thinking. Convergent thinking is assessed based on the correctness and adherence to constraints of the generated solutions, while divergent thinking is evaluated by comparing the novelty of the techniques used in the solutions against a historical set of human-contributed solutions.

Experimentation and Dataset

The framework is employed to examine multiple state-of-the-art LLMs, including proprietary models such as GPT-4, GPT-3.5, and Claude-3, as well as open-source models like Llama3-70B and CodeGemma-7B. The evaluation uses a newly constructed dataset, NeoCoder, which consists of 199 recent Codeforces problems and 30 curated human solutions per problem. This dataset, along with the accompanying constraints formulated through Denial Prompting, serves as a benchmark for future research on machine creativity.

Key Findings

  1. Performance Disparity: GPT-4 demonstrated superior overall creativity (NeoGauge) relative to other models, especially in more complex, constraint-heavy scenarios. While smaller models exhibited some capability in generating novel techniques, their solutions often lacked correctness and adherence to constraints, highlighting the necessity of a balanced approach to evaluating creativity.
  2. Human-Like Creativity: Despite advancements, even the most capable LLMs, including GPT-4, fell short of human-level creativity. The study reports that humans significantly outperform LLMs in convergent thinking, as evidenced by higher correctness and constraint-following ratios in their solutions.
  3. Divergent Thinking Challenges: The divergent thinking scores of LLMs, although reasonable, suggest that models frequently generate novel but impractical or incorrect solutions. This underscores the need for more robust mechanisms to guide LLMs in producing functional, innovative solutions.

Reasoning Strategies and Creativity

The paper also evaluates several reasoning strategies – including MCTS, self-correction, planning, and sampling – to understand their impact on creativity:

  • MCTS showed the best improvement in divergent creativity but did not enhance overall creativity due to a decline in the correctness of solutions.
  • Self-Correction and Planning significantly improved convergent thinking but had negligible or even negative effects on divergent thinking.
  • Sampling showed minimal impact on both aspects of creativity, suggesting a lack of substantial benefit from additional generated samples under the evaluated conditions.

Implications and Future Work

The implications of this work are multifaceted, providing both practical and theoretical insights:

  • Practical: The introduction of NeoGauge and the NeoCoder dataset sets a new standard for evaluating LLM creativity, aiding developers in identifying and addressing weaknesses in model outputs.
  • Theoretical: The detailed analysis of LLMs versus human creativity provides a basis for future research into enhancing machine creativity, potentially guiding the development of new architectures or training methodologies aimed at better mimicking human creative processes.

The study opens several avenues for further investigation. Future research may focus on exploring more sophisticated reasoning strategies to boost both convergent and divergent thinking, examining the impact of larger and more diverse datasets, or tailoring pre-training objectives specifically for creativity enhancement.

In summary, this paper makes a significant contribution to the field of AI by providing a rigorous and nuanced framework for assessing and improving the creative capabilities of LLMs, particularly in the domain of code generation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 35 likes about this paper.