Evaluating Code Summarization with LLMs
The paper "Can LLMs Serve as Evaluators for Code Summarization?" investigates the utilizability of LLMs as evaluators in the domain of automatic code summarization. Given the critical role of code summarization in enhancing program comprehension and facilitating software maintenance, accurate and scalable evaluation methods are pivotal. Traditionally, human evaluation has been considered the benchmark for assessing code summarization quality. However, this method is inherently labor-intensive and lacks scalability, thus inciting reliance on automatic metrics such as BLEU, ROUGE-L, METEOR, and BERTScore. Despite their popularity, these metrics frequently display weak correlation with human evaluations, as they predominantly focus on n-gram overlaps and fail to capture the semantic essence of generated summaries.
In response to these evaluation challenges, the paper introduces C<sub>ODE</sub>RPE (Role-Player for Code Summarization Evaluation), a novel approach that leverages the potential of LLMs for reference-free evaluation. The core proposition of the method is to prompt an LLM to adopt various roles - such as code reviewer, original code author, code editor, and systems analyst - to evaluate generated summaries across four dimensions: coherence, consistency, fluency, and relevance.
In empirical evaluations, the paper reports a significant improvement using the proposed C<sub>ODE</sub>RPE method. Specifically, the Spearman correlation with human judgments reaches 81.59%, representing an enhancement of 17.27% over existing metrics like BERTScore. This is achieved through a combination of sophisticated prompting strategies including chain-of-thought (CoT) reasoning, in-context learning demonstrations, and tailored rating forms. Different prompting techniques are evaluated to explore the most effective configurations for the task. Notably, the role-player prompting strategy allows LLMs to interpret and assess the multi-faceted nature of code summaries more effectively than conventional metrics.
The paper's findings suggest several important implications. From a practical standpoint, developing LLM-based evaluation systems for code summarization could significantly reduce the dependency on scalable human resource commitments. Theoretically, the research demonstrates a promising direction for leveraging LLMs in evaluating code-based tasks, sparking potential interest in other code intelligence areas, such as code generation and bug detection. Moreover, the research prompts a broader question about the future abilities and roles of LLMs as oracles in various software engineering domains.
Looking ahead, this research invites further exploration into refining the role-play prompting strategy and optimizing instructions to maximize LLM performance as evaluators. It also encourages future experiments with a wider range of LLMs and potentially new prompting techniques. Overall, the paper provides a compelling argument for utilizing LLMs in evaluation tasks where the quality is subjective and traditionally assessed through labor-intense methodologies.