On the Evaluation of Neural Code Summarization (2107.07112v2)

Published 15 Jul 2021 in cs.SE and cs.AI

Abstract: Source code summaries are important for program comprehension and maintenance. However, there are plenty of programs with missing, outdated, or mismatched summaries. Recently, deep learning techniques have been exploited to automatically generate summaries for given code snippets. To achieve a profound understanding of how far we are from solving this problem and provide suggestions to future research, in this paper, we conduct a systematic and in-depth analysis of 5 state-of-the-art neural code summarization models on 6 widely used BLEU variants, 4 pre-processing operations and their combinations, and 3 widely used datasets. The evaluation results show that some important factors have a great influence on the model evaluation, especially on the performance of models and the ranking among the models. However, these factors might be easily overlooked. Specifically, (1) the BLEU metric widely used in existing work of evaluating code summarization models has many variants. Ignoring the differences among these variants could greatly affect the validity of the claimed results. Furthermore, we conduct human evaluations and find that the metric BLEU-DC is most correlated to human perception; (2) code pre-processing choices can have a large (from -18\% to +25\%) impact on the summarization performance and should not be neglected. We also explore the aggregation of pre-processing combinations and boost the performance of models; (3) some important characteristics of datasets (corpus sizes, data splitting methods, and duplication ratios) have a significant impact on model evaluation. Based on the experimental results, we give actionable suggestions for evaluating code summarization and choosing the best method in different scenarios. We also build a shared code summarization toolbox to facilitate future research.

Citations (78)

View on Semantic Scholar

Summary

The paper reveals that BLEU-DC, a sentence-level BLEU metric with smoothing, best aligns with human judgment in evaluating neural code summaries.
The paper finds that differing pre-processing strategies can shift model performance by -18% to +25%, with ensemble methods showing promise.
The paper demonstrates that dataset characteristics significantly affect model evaluations, highlighting the need for diverse testing to ensure fairness.

An Evaluation of Neural Code Summarization Techniques

The paper presented in this paper offers a comprehensive examination of the methodologies and evaluation metrics applied to neural code summarization. This is crucial as code summarization plays a significant role in enhancing program comprehension and maintenance but remains a labor-intensive task.

Key Findings

The authors conduct a systematic evaluation of five state-of-the-art neural code summarization models—CodeNN, Deepcom, Astattgru, Rencos, and NCS—across multiple BLEU variants, code pre-processing combinations, and datasets. The paper reveals several critical insights into the evaluation of code summarization models:

Evaluation Metrics Impact: The paper identifies substantial variability in BLEU scores produced by different variants of the metric. It highlights that ignoring the differences among these BLEU variants can lead to inconsistent and misleading results. The authors found that BLEU-DC, a sentence-level BLEU variant with smoothing, aligns most closely with human judgment.
Pre-processing Variations: Different code pre-processing operations have a significant impact, ranging from -18% to +25% on the performance of code summarization models. No single pre-processing strategy universally dominated across all models. However, combining multiple pre-processing operations using ensemble methodologies showed promise in improving model performance.
Dataset Characteristics: The results demonstrate that the performance of code summarization models varies significantly across different datasets due to factors such as dataset size, data splitting methods, and code duplication. The paper underscores the necessity of considering these factors during model evaluation to ensure fairness and generalizability.

Implications for Future Research

The findings from this paper imply that future research in the field of code summarization should take a multi-faceted approach to evaluate models. Consequently, researchers should:

Utilize a comprehensive set of evaluation metrics, ensuring that the specifics of each metric variant are clearly stated to avoid misinterpretations.
Experiment with various pre-processing strategies to identify optimal configurations and consider ensemble methods to leverage the strengths of different approaches.
Select multiple datasets with varied characteristics to assess the robustness and generalizability of code summarization models.

Conclusion

This paper provides actionable insights that enhance the integrity of evaluations in the domain of neural code summarization. It charts a course for improving the reliability of these models by emphasizing the importance of metric precision, diversified pre-processing, and comprehensive dataset selection. Looking ahead, integrating these practices can lead to more effective and generalizable code summarization solutions, ultimately facilitating better program comprehension and maintenance.

PDF Markdown

Related Papers

YouTube

Show All Videos