CodeScore: Evaluating Code Generation by Learning Code Execution
The paper proposes an innovative evaluation metric for code generation, named CodeScore, which aims to overcome the limitations of traditional match-based code evaluation metrics (CEMs) that focus on surface-level differences and are restricted to specific input formats. CodeScore employs LLMs to assess functional correctness of generated code across three input types, namely Ref-only, NL-only, and Ref{content}NL. The authors introduce a unified code generation learning framework, UniCE, to train LLMs for predicting PassRatio and Executability through simulated code execution.
Motivation and Challenges
The automatic evaluation of code generation is of substantial interest within both NLP and software engineering communities. Existing match-based CEMs like BLEU and CodeBLEU primarily emphasize lexical features and fail to account for functional equivalence, an essential factor for code evaluation. Furthermore, these metrics are designed to manage only Ref-only input formats, limiting their adaptability when natural language descriptions (NL) or additional context are involved.
CodeScore and UniCE Framework
CodeScore, as described in the paper, is an LLM-based metric measuring functional correctness by evaluating execution output similarity. The UniCE framework is designed to finetune LLMs, enabling them to learn code execution with unified inputs. The model evaluates generated code based on PassRatio—the fraction of test cases passed over total cases—and binary Executability, which distinguishes between executable and non-executable code. Through multiple experiments, the approach achieved up to 58.87% better correlation with functional correctness than other CEMs.
Experimental Validation
Empirical results demonstrate CodeScore's efficacy across three constructed datasets—APPS-Eval, MBPP-Eval, and HE-Eval. Notably, CodeScore outperformed traditional metrics and LLM-based EMs, establishing strong correlation with functional correctness and reducing mean absolute error. Additionally, the paper highlights CodeScore's versatility across different input formats. Its evaluation speed is significantly enhanced, drastically lowering the computational cost compared to execution-based CEMs.
Implications and Future Directions
The paper provides a pathway toward more accurate and computationally efficient code evaluation metrics. This research potentially facilitates the advancement of code generation technologies by improving feedback accuracy for model training, revolutionizing programming paradigms, and cutting development costs. Future work might expand CodeScore's capabilities to encompass broader programming scenarios and refine its efficiency further.
In conclusion, CodeScore presents a robust approach to measuring code functional correctness, addressing longstanding inefficiencies in match-based CEMs. This advancement supports more holistic and practical code evaluation, paving the path for future innovations in AI-driven coding solutions.