Can Large Language Models Serve as Evaluators for Code Summarization? (2412.01333v1)

Published 2 Dec 2024 in cs.SE

Abstract: Code summarization facilitates program comprehension and software maintenance by converting code snippets into natural-language descriptions. Over the years, numerous methods have been developed for this task, but a key challenge remains: effectively evaluating the quality of generated summaries. While human evaluation is effective for assessing code summary quality, it is labor-intensive and difficult to scale. Commonly used automatic metrics, such as BLEU, ROUGE-L, METEOR, and BERTScore, often fail to align closely with human judgments. In this paper, we explore the potential of LLMs for evaluating code summarization. We propose CODERPE (Role-Player for Code Summarization Evaluation), a novel method that leverages role-player prompting to assess the quality of generated summaries. Specifically, we prompt an LLM agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst. Each role evaluates the quality of code summaries across key dimensions, including coherence, consistency, fluency, and relevance. We further explore the robustness of LLMs as evaluators by employing various prompting strategies, including chain-of-thought reasoning, in-context learning, and tailored rating form designs. The results demonstrate that LLMs serve as effective evaluators for code summarization methods. Notably, our LLM-based evaluator, CODERPE , achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.

Authors (8)

Yang Wu (175 papers)
Yao Wan (70 papers)
Zhaoyang Chu (7 papers)
Wenting Zhao (44 papers)
Ye Liu (153 papers)
Hongyu Zhang (147 papers)
Xuanhua Shi (20 papers)
Philip S. Yu (592 papers)

Summary

Evaluating Code Summarization with LLMs

The paper "Can LLMs Serve as Evaluators for Code Summarization?" investigates the utilizability of LLMs as evaluators in the domain of automatic code summarization. Given the critical role of code summarization in enhancing program comprehension and facilitating software maintenance, accurate and scalable evaluation methods are pivotal. Traditionally, human evaluation has been considered the benchmark for assessing code summarization quality. However, this method is inherently labor-intensive and lacks scalability, thus inciting reliance on automatic metrics such as BLEU, ROUGE-L, METEOR, and BERTScore. Despite their popularity, these metrics frequently display weak correlation with human evaluations, as they predominantly focus on n-gram overlaps and fail to capture the semantic essence of generated summaries.

In response to these evaluation challenges, the paper introduces C<sub>ODE</sub>RPE (Role-Player for Code Summarization Evaluation), a novel approach that leverages the potential of LLMs for reference-free evaluation. The core proposition of the method is to prompt an LLM to adopt various roles - such as code reviewer, original code author, code editor, and systems analyst - to evaluate generated summaries across four dimensions: coherence, consistency, fluency, and relevance.

In empirical evaluations, the paper reports a significant improvement using the proposed C<sub>ODE</sub>RPE method. Specifically, the Spearman correlation with human judgments reaches 81.59%, representing an enhancement of 17.27% over existing metrics like BERTScore. This is achieved through a combination of sophisticated prompting strategies including chain-of-thought (CoT) reasoning, in-context learning demonstrations, and tailored rating forms. Different prompting techniques are evaluated to explore the most effective configurations for the task. Notably, the role-player prompting strategy allows LLMs to interpret and assess the multi-faceted nature of code summaries more effectively than conventional metrics.

The paper's findings suggest several important implications. From a practical standpoint, developing LLM-based evaluation systems for code summarization could significantly reduce the dependency on scalable human resource commitments. Theoretically, the research demonstrates a promising direction for leveraging LLMs in evaluating code-based tasks, sparking potential interest in other code intelligence areas, such as code generation and bug detection. Moreover, the research prompts a broader question about the future abilities and roles of LLMs as oracles in various software engineering domains.

Looking ahead, this research invites further exploration into refining the role-play prompting strategy and optimizing instructions to maximize LLM performance as evaluators. It also encourages future experiments with a wider range of LLMs and potentially new prompting techniques. Overall, the paper provides a compelling argument for utilizing LLMs in evaluation tasks where the quality is subjective and traditionally assessed through labor-intense methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1866622802192990241