Evaluation of Code Generation Benchmarks: Exploring Mutation Strategies
The paper "Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach" addresses the limitations of current methods for evaluating Code LLMs (CLLMs) in the context of program synthesis. The authors critique the prevalent use of static, manually curated benchmarks that employ singular input prompts for each coding problem, which inadequately reflect the variety of real-world scenarios where problems can be presented in diverse ways. This inherent limitation can lead to discrepancies between reported and actual performance of CLLMs when confronted with practical usage.
The authors introduce a novel evaluation framework employing mutation strategies intended to simulate real-world variations and perturbations in input prompts. Their approach involves creating input variants through methods such as typo simulation, synonym substitution, paraphrasing, summarization, and example manipulation. Specifically, the paper introduces ten mutation strategies and three new metrics: Correctness Variability (CV), Mutation Bias (MB), and Best Pass@k (Pass@k_b), to gauge how these mutations affect CLLM performance in a more nuanced manner compared to existing benchmarks like HumanEval.
Key findings highlight significant inconsistencies between the performance of CLLMs evaluated with traditional benchmarks and those using mutated input prompts. Notably, even slight modifications in the phrasing of problem descriptions can lead to substantial differences in the performance of various models, with certain CLLMs showing improvement or decline depending on specific types of mutations. For instance, the paper describes how typos in variable names can sometimes enhance, rather than detract from, the performance of models.
The analysis, conducted on five popular CLLMs – namely, DeepSeek, Llama3.1, CodeLlama, CodeGen, and InCoder – involving 12,834 prompt variations and employing 10 different mutation strategies, brings to light several insights. A key observation is that higher-performing models demonstrate greater susceptibility to fluctuations in performance when exposed to mutations in problem descriptions, indicating a reliance on specific formulations of inputs. Conversely, poorly performing models seem more sensitive to changes in function and variable names, suggesting varying degrees of semantic understanding.
This paper's implications are far-reaching for both the practical applications and theoretical understanding of CLLMs. By revealing the embedded biases in existing evaluation methodologies and highlighting the need for more comprehensive benchmarks that incorporate a wider range of prompt variations, this research advocates for evolving the evaluation landscape of code synthesis tasks. Future developments could incorporate an iterative approach where models are assessed on their ability to synthesize code only after confirming understanding through corrective feedback loops, potentially leading to more robust assessments of model capabilities that better mirror real-world performance scenarios.
This paper is an important contribution to the field, underscoring the necessity for rigorous assessment techniques that account for the variability of natural language and code language ambiguities that CLLMs encounter. As research in artificial intelligence and code synthesis progresses, incorporating these nuanced evaluation methodologies could pave the way for more equitable model comparisons and encourage development efforts across diverse model architectures and training paradigms.