Benchmarking LLM Creativity: A Case Study on Code Generation
The paper "Benchmarking LLM Creativity: A Case Study on Code Generation" by Yining Lu et al. introduces a novel framework and evaluation methodology for quantifying the creativity of LLMs. By addressing the dual facets of cognitive creativity, namely, convergent and divergent thinking, the paper proposes a tailored approach to eliciting and measuring creative outputs from LLMs, specifically in the context of code generation challenges from Codeforces.
Framework and Methodology
The authors pioneer a dual-component framework to quantify LLM creativity:
- Denial Prompting: A systematic prompting method where the LLM is asked to solve coding problems under progressively restrictive constraints. By iteratively prohibiting the use of specific programming techniques used in the previous solution (e.g., for-loops, if-statements), Denial Prompting coerces the model to explore less conventional solutions, thus stimulating out-of-the-box thinking.
- NeoGauge Metric: A comprehensive creativity score that evaluates both convergent and divergent thinking. Convergent thinking is assessed based on the correctness and adherence to constraints of the generated solutions, while divergent thinking is evaluated by comparing the novelty of the techniques used in the solutions against a historical set of human-contributed solutions.
Experimentation and Dataset
The framework is employed to examine multiple state-of-the-art LLMs, including proprietary models such as GPT-4, GPT-3.5, and Claude-3, as well as open-source models like Llama3-70B and CodeGemma-7B. The evaluation uses a newly constructed dataset, NeoCoder, which consists of 199 recent Codeforces problems and 30 curated human solutions per problem. This dataset, along with the accompanying constraints formulated through Denial Prompting, serves as a benchmark for future research on machine creativity.
Key Findings
- Performance Disparity: GPT-4 demonstrated superior overall creativity (NeoGauge) relative to other models, especially in more complex, constraint-heavy scenarios. While smaller models exhibited some capability in generating novel techniques, their solutions often lacked correctness and adherence to constraints, highlighting the necessity of a balanced approach to evaluating creativity.
- Human-Like Creativity: Despite advancements, even the most capable LLMs, including GPT-4, fell short of human-level creativity. The paper reports that humans significantly outperform LLMs in convergent thinking, as evidenced by higher correctness and constraint-following ratios in their solutions.
- Divergent Thinking Challenges: The divergent thinking scores of LLMs, although reasonable, suggest that models frequently generate novel but impractical or incorrect solutions. This underscores the need for more robust mechanisms to guide LLMs in producing functional, innovative solutions.
Reasoning Strategies and Creativity
The paper also evaluates several reasoning strategies – including MCTS, self-correction, planning, and sampling – to understand their impact on creativity:
- MCTS showed the best improvement in divergent creativity but did not enhance overall creativity due to a decline in the correctness of solutions.
- Self-Correction and Planning significantly improved convergent thinking but had negligible or even negative effects on divergent thinking.
- Sampling showed minimal impact on both aspects of creativity, suggesting a lack of substantial benefit from additional generated samples under the evaluated conditions.
Implications and Future Work
The implications of this work are multifaceted, providing both practical and theoretical insights:
- Practical: The introduction of NeoGauge and the NeoCoder dataset sets a new standard for evaluating LLM creativity, aiding developers in identifying and addressing weaknesses in model outputs.
- Theoretical: The detailed analysis of LLMs versus human creativity provides a basis for future research into enhancing machine creativity, potentially guiding the development of new architectures or training methodologies aimed at better mimicking human creative processes.
The paper opens several avenues for further investigation. Future research may focus on exploring more sophisticated reasoning strategies to boost both convergent and divergent thinking, examining the impact of larger and more diverse datasets, or tailoring pre-training objectives specifically for creativity enhancement.
In summary, this paper makes a significant contribution to the field of AI by providing a rigorous and nuanced framework for assessing and improving the creative capabilities of LLMs, particularly in the domain of code generation.