ICE-Score: Instructing Large Language Models to Evaluate Code (2304.14317v2)

Published 27 Apr 2023 in cs.AI, cs.CL, and cs.SE

Abstract: Recent advancements in the field of natural language generation have facilitated the use of LLMs to assess the quality of generated text. Although these models have shown promising results in tasks such as machine translation and summarization, their applicability in code intelligence tasks remains limited without human involvement. The complexity of programming concepts required for such tasks makes it difficult to develop evaluation metrics that align with human judgment. Token-matching-based metrics, such as BLEU, have demonstrated weak correlations with human practitioners in code intelligence tasks. Moreover, utilizing human-written test suites to evaluate functional correctness can be challenging in domains with low resources. To overcome these obstacles, we propose \texttt{ICE-Score}, a new evaluation metric via instructing LLMs for code assessments. Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences, without the need for test oracles or references. We evaluate the efficacy of our metric on two different aspects (\textit{human preference} and \textit{execution success}) and four programming languages. Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation, delivering high levels of accuracy and consistency across various programming languages and tasks. We also make our evaluation metric and datasets available to the public\footnote{\url{https://github.com/terryyz/ice-score}}, encouraging further research in evaluating code intelligence tasks.

PDF Abstract

Overview of ICE-Score: Instructing LLMs to Evaluate Code

The paper "ICE-Score: Instructing LLMs to Evaluate Code" addresses the complexities inherent in evaluating the performance of code generation systems. It underscores the inadequacies of traditional evaluation metrics, which often rely on simplistic token-matching techniques and are poorly aligned with human judgment. In contrast, ICE-Score offers a novel framework for assessing code using LLMs, poised to improve the assessment of both functional correctness and subjective human preferences.

Motivation and Novelty

Traditional metrics such as BLEU and ROUGE, while prevalent in natural language processing tasks, demonstrate significant limitations in code evaluation. These metrics fail to capture semantic nuances and are often dependent on high-quality references, which are scarce in low-resource domains. The task of evaluating code is further compounded by its syntax complexity and the necessity of understanding intricate programming concepts.

ICE-Score introduces a multi-dimensional evaluation paradigm that leverages LLMs, circumventing the need for reference materials or test oracles. By instructing LLMs with specific evaluation criteria, ICE-Score aligns closely with both functional correctness and subjective human preferences.

Methodology and Implementation

ICE-Score's implementation is bifurcated into distinct components: the crafting of task-specific instructions and the evaluation of generated code snippets against predefined criteria. The approach uses GPT-3.5 as a foundational model, setting it apart from previous metrics through its reliance solely on evaluation criteria and template-based evaluation, eschewing the need for instruction generation or weighted scoring functions.

Key aspects of ICE-Score include:

Task Instructions: Detailed instructions guide LLMs in assessing code usefulness and correctness, encompassing multiple dimensions of evaluation.
Evaluation Inputs: The model is fed the task requirements and the corresponding output, allowing LLMs to understand the expected functionality without explicit formatting.

Experimental Evaluation

The efficacy of ICE-Score was validated through comprehensive experiments across multiple programming languages, including Java, Python, C++, and JavaScript. Evaluation focused on two main aspects: human-based usefulness and execution-based functional correctness. This paper involved extensive baselines, including string-based metrics like BLEU and neural-model-based metrics like CodeBERTScore.

Results

ICE-Score outperformed established metrics, achieving superior correlations with human judgment and functional correctness on both example and corpus levels. Notably, the reference-free evaluation using ICE-Score exceeded the performance of neural-model-based techniques, and when enhanced with reference inputs, demonstrated even greater alignment with human evaluators.

Implications and Future Directions

ICE-Score not only provides an empirically robust framework for evaluating code generation but also suggests potential applications in various code intelligence tasks, including debugging and optimization. The paper advances understanding of LLM capabilities, hinting at future directions for pursuing more human-aligned evaluation processes across additional tasks such as code summarization and translation.

The proposed contribution extends beyond its immediate application, encouraging future exploration into optimizing LLMs for code evaluation through the development of more nuanced, context-sensitive models. Furthermore, this work could inspire the design of low-cost, high-efficiency evaluation frameworks suitable for broader AI research domains.

In conclusion, ICE-Score constitutes a significant step forward in the quest for reliable, scalable, and human-aligned metrics for code evaluation, thus broadening the horizons of both theoretical research and practical applications within AI-driven software engineering.