Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity (2501.05176v1)

Published 9 Jan 2025 in cs.SE

Abstract: Code review is a standard practice for ensuring the quality of software projects, and recent research has focused extensively on automated code review. While significant advancements have been made in generating code reviews, the automated assessment of these reviews remains less explored, with existing approaches and metrics often proving inaccurate. Current metrics, such as BLEU, primarily rely on lexical similarity between generated and reference reviews. However, such metrics tend to underestimate reviews that articulate the expected issues in ways different from the references. In this paper, we explore how semantic similarity between generated and reference reviews can enhance the automated assessment of code reviews. We first present a benchmark called \textit{GradedReviews}, which is constructed by collecting real-world code reviews from open-source projects, generating reviews using state-of-the-art approaches, and manually assessing their quality. We then evaluate existing metrics for code review assessment using this benchmark, revealing their limitations. To address these limitations, we propose two novel semantic-based approaches for assessing code reviews. The first approach involves converting both the generated review and its reference into digital vectors using a deep learning model and then measuring their semantic similarity through Cosine similarity. The second approach generates a prompt based on the generated review and its reference, submits this prompt to ChatGPT, and requests ChatGPT to rate the generated review according to explicitly defined criteria. Our evaluation on the \textit{GradedReviews} benchmark indicates that the proposed semantic-based approaches significantly outperform existing state-of-the-art metrics in assessing generated code review, improving the correlation coefficient between the resulting scores and human scores from 0.22 to 0.47.

Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity

The paper "Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity" presents a critical analysis of existing methods for assessing the quality of automatically generated code reviews, emphasizing the inadequacies of traditional lexical similarity-based metrics, such as BLEU, and advocating for a shift towards semantic-based evaluation methodologies. Recognizing the limitations of BLEU and similar metrics—chiefly their reliance on n-gram overlap which neglects semantic understanding—the authors propose novel semantic-based methods that aim to offer a more nuanced and accurate assessment of code reviews.

Contributions

The authors introduce a new benchmark, GradedReviews, which addresses the lack of high-quality datasets for evaluating code review generation approaches. GradedReviews is composed of 5,164 automatically generated code reviews paired with manual scores, derived from real-world data collected from open-source projects. The benchmark enables a comprehensive evaluation of existing metrics, revealing their deficiencies in capturing semantic equivalence between generated and reference code reviews.

To overcome these limitations, the authors propose two innovative approaches:

  1. Embedding-Based Scoring: This method involves the use of advanced deep embedding models to convert code reviews into digital vectors, measuring their semantic similarity through Cosine similarity. This approach captures the semantic relationships between texts more effectively than metrics based solely on lexical similarity.
  2. LLM-Based Scoring: By leveraging the capabilities of LLMs like ChatGPT, this approach directly evaluates the generated reviews. It formulates a comparison between the generated and reference reviews, relying on the natural language understanding prowess of LLMs to rate reviews according to predefined criteria.

Evaluation and Results

The evaluation conducted using the GradedReviews benchmark demonstrates the superiority of the proposed semantic-based approaches over traditional metrics. The embedding-based scoring method significantly improves the correlation coefficient between the generated scores and human-assigned scores from 0.22 (BLEU) to 0.38. The LLM-based scoring method further enhances this correlation to 0.47, showcasing its efficacy in aligning closely with human evaluations.

An interesting observation is the qualitative and quantitative analysis revealing that the embedding-based approach markedly reduces the overlap in score distributions for different quality levels compared to BLEU. This indicates a better ability to distinguish between subpar, average, and high-quality reviews.

Implications and Future Work

The findings suggest that semantic-based assessments offer a compelling advancement in the evaluation of code review generation approaches, potentially extending their applicability beyond code reviews to other areas like code generation, text summarization, and natural language translation. Given the rapid evolution of LLMs and embedding models, this research sets the stage for further exploration of semantic similarity metrics in various domains of software engineering.

Future work could include expanding the GradedReviews benchmark by incorporating more diverse datasets and tools, as well as exploring the potential of these semantic-based approaches in related tasks. The comparative analysis between embedding and direct LLM-based assessment provides a foundation for optimizing automated evaluation systems across different applications, inviting further inquiry into balancing accuracy and computational efficiency.

Through a rigorous empirical paper and the introduction of innovative semantic-based evaluation methods, this paper contributes to the advancement of automated assessment techniques, encouraging researchers and practitioners to reconsider the metrics used in evaluating generated content, ensuring alignment with human judgment.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yanjie Jiang (7 papers)
  2. Hui Liu (481 papers)
  3. Tianyi Chen (139 papers)
  4. Fu Fan (1 paper)
  5. Chunhao Dong (1 paper)
  6. Kui Liu (55 papers)
  7. Lu Zhang (373 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com