Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 85 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 419 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach (2505.06880v1)

Published 11 May 2025 in cs.SE

Abstract: Code LLMs (CLLMs) have exhibited outstanding performance in program synthesis, attracting the focus of the research community. The evaluation of CLLM's program synthesis capability has generally relied on manually curated benchmarks. However, there is a substantial gap between real-world scenarios and benchmark settings. Existing benchmarks typically provide only a single input prompt for the evaluation of each synthesis problem. However, in practice, a problem can be described in various ways, including with typos, where developers may struggle to understand certain descriptions and seek clarification to find more suitable wording. Such various descriptions may lead to variations in the performance of CLLMs on the same question, resulting in a biased evaluation when using existing benchmarks. In this paper, we aim to explore these pitfalls with the goal of revisiting and enhancing future benchmark designs. To simulate real-world variations in problem descriptions, we propose 10 mutation strategies and introduce three new metrics to evaluate their impact on code generation. We then assess five popular CLLMs using 12,834 generated prompt variants, and found a significant performance discrepancy between the results from existing benchmarks and those from mutated benchmarks containing perturbations and variations. This finding underscores the need for more robust evaluation methods and benchmarks.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

Evaluation of Code Generation Benchmarks: Exploring Mutation Strategies

The paper "Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach" addresses the limitations of current methods for evaluating Code LLMs (CLLMs) in the context of program synthesis. The authors critique the prevalent use of static, manually curated benchmarks that employ singular input prompts for each coding problem, which inadequately reflect the variety of real-world scenarios where problems can be presented in diverse ways. This inherent limitation can lead to discrepancies between reported and actual performance of CLLMs when confronted with practical usage.

The authors introduce a novel evaluation framework employing mutation strategies intended to simulate real-world variations and perturbations in input prompts. Their approach involves creating input variants through methods such as typo simulation, synonym substitution, paraphrasing, summarization, and example manipulation. Specifically, the paper introduces ten mutation strategies and three new metrics: Correctness Variability (CV), Mutation Bias (MB), and Best Pass@k (Pass@k_b), to gauge how these mutations affect CLLM performance in a more nuanced manner compared to existing benchmarks like HumanEval.

Key findings highlight significant inconsistencies between the performance of CLLMs evaluated with traditional benchmarks and those using mutated input prompts. Notably, even slight modifications in the phrasing of problem descriptions can lead to substantial differences in the performance of various models, with certain CLLMs showing improvement or decline depending on specific types of mutations. For instance, the paper describes how typos in variable names can sometimes enhance, rather than detract from, the performance of models.

The analysis, conducted on five popular CLLMs – namely, DeepSeek, Llama3.1, CodeLlama, CodeGen, and InCoder – involving 12,834 prompt variations and employing 10 different mutation strategies, brings to light several insights. A key observation is that higher-performing models demonstrate greater susceptibility to fluctuations in performance when exposed to mutations in problem descriptions, indicating a reliance on specific formulations of inputs. Conversely, poorly performing models seem more sensitive to changes in function and variable names, suggesting varying degrees of semantic understanding.

This paper's implications are far-reaching for both the practical applications and theoretical understanding of CLLMs. By revealing the embedded biases in existing evaluation methodologies and highlighting the need for more comprehensive benchmarks that incorporate a wider range of prompt variations, this research advocates for evolving the evaluation landscape of code synthesis tasks. Future developments could incorporate an iterative approach where models are assessed on their ability to synthesize code only after confirming understanding through corrective feedback loops, potentially leading to more robust assessments of model capabilities that better mirror real-world performance scenarios.

This paper is an important contribution to the field, underscoring the necessity for rigorous assessment techniques that account for the variability of natural language and code language ambiguities that CLLMs encounter. As research in artificial intelligence and code synthesis progresses, incorporating these nuanced evaluation methodologies could pave the way for more equitable model comparisons and encourage development efforts across diverse model architectures and training paradigms.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube