Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization (2312.15475v1)

Published 24 Dec 2023 in cs.SE

Abstract: Several code summarization techniques have been proposed in the literature to automatically document a code snippet or a function. Ideally, software developers should be involved in assessing the quality of the generated summaries. However, in most cases, researchers rely on automatic evaluation metrics such as BLEU, ROUGE, and METEOR. These metrics are all based on the same assumption: The higher the textual similarity between the generated summary and a reference summary written by developers, the higher its quality. However, there are two reasons for which this assumption falls short: (i) reference summaries, e.g., code comments collected by mining software repositories, may be of low quality or even outdated; (ii) generated summaries, while using a different wording than a reference one, could be semantically equivalent to it, thus still being suitable to document the code snippet. In this paper, we perform a thorough empirical investigation on the complementarity of different types of metrics in capturing the quality of a generated summary. Also, we propose to address the limitations of existing metrics by considering a new dimension, capturing the extent to which the generated summary aligns with the semantics of the documented code snippet, independently from the reference summary. To this end, we present a new metric based on contrastive learning to capture said aspect. We empirically show that the inclusion of this novel dimension enables a more effective representation of developers' evaluations regarding the quality of automatically generated summaries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (97)
  1. [n. d.]. CC-News Dataset. https://commoncrawl.org/2016/10/news-dataset-available/.
  2. [n. d.]. Replication Package https://github.com/antonio-mastropaolo/code-summarization-metric/tree/main.
  3. [n. d.]. Stories Dataset. https://github.com/tensorflow/models/tree/archive/research/lm_commonsense#1-download-data-files.
  4. Automated Documentation of Android Apps. IEEE Transactions on Software Engineering 45, 1 (2019), 1–13.
  5. Unified Pre-training for Program Understanding and Generation. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 2655–2668.
  6. A Convolutional Attention Network for Extreme Summarization of Source Code. In International Conference on Machine Learning (ICML).
  7. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400 (2018).
  8. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65–72. https://aclanthology.org/W05-0909
  9. Yoav Benjamini and Yosef Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 1 (1995).
  10. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).
  11. Automatically detecting the scopes of source code comments. Journal of Systems and Software, JSS 153 (2019), 45–63.
  12. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Vol. 1. IEEE, 539–546.
  13. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364 (2017).
  14. W. J. Conover. 1998. Practical Nonparametric Statistics (3rd edition ed.). Wiley.
  15. Ward Cunningham. 1992. The WyCash portfolio management system. ACM Sigplan Oops Messenger 4, 2 (1992), 29–30.
  16. A Study of the Documentation Essential to Software Maintenance. In Proceedings of the 23rd Annual International Conference on Design of Communication: Documenting &Amp; Designing for Pervasive Information (SIGDOC ’05). 68–75.
  17. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 4171–4186.
  18. Do Code and Comments Co-Evolve? On the Relation between Source Code and Comment Changes. In 14th Working Conference on Reverse Engineering, WCRE. 70–79.
  19. Analyzing the Co-evolution of Comments and Source Code. Software Quality Journal, SQJ 17, 4 (2009), 367–394.
  20. An Efficient Method of Supervised Contrastive Learning for Natural Language Understanding. In 7th International Conference on Computer and Communications (ICCC). 1698–1704.
  21. Aaron Gokaslan and Vanya Cohen. 2019. OpenWebText Corpus. http://Skylion007.github.io/OpenWebTextCorpus.
  22. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. 33–41.
  23. Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 746–757.
  24. GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations, ICLR 2021.
  25. Dimensionality Reduction by Learning an Invariant Mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR. 1735–1742.
  26. On the Use of Automated Text Summarization Techniques for Summarizing Source Code. In 17th Working Conference on Reverse Engineering, WCRE. 35–44.
  27. John M Hancock. 2004. Jaccard distance (Jaccard index, Jaccard similarity coefficient). Dictionary of Bioinformatics and Computational Biology (2004).
  28. Semantic Similarity Metrics for Evaluating Source Code Summarization. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension (ICPC ’22). 36–47.
  29. Improved Automatic Summarization of Subroutines via Attention to File Context. arXiv:2004.04881 [cs.SE]
  30. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc R package version 4.3-1.
  31. Correlating Automated and Human Evaluation of Code Documentation Generation Quality. 31, 4 (2022).
  32. Deep code comment generation with hybrid lexical and syntactical information. Springer Empirical Software Engineering, EMSE 25, 3 (2020), 2179–2217.
  33. WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. In Findings of the Association for Computational Linguistics: EMNLP. 238–244.
  34. Towards automatically generating block comments for code snippets. Inf. Softw. Technol. 127 (2020).
  35. Peter J Huber. 1992. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution. 492–518.
  36. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  37. Summarizing Source Code using a Neural Attention Model. In Proc. of the 54th Annual Meeting of the Association for Computational Linguistics. 2073–2083.
  38. Ian T. Jolliffe. 1986. Principal Component Analysis. Springer.
  39. Automatic Quality Assessment of Source Code Comments: The JavadocMiner. In Natural Language Processing and Information Systems, Christina J. Hopfe, Yacine Rezgui, Elisabeth Métais, Alun Preece, and Haijiang Li (Eds.). 68–79.
  40. Solomon Kullback. 1997. Information theory and statistics. Courier Corporation.
  41. A neural model for generating natural language summaries of program subroutines. In 41st IEEE/ACM International Conference on Software Engineering, ICSE. 795–806.
  42. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 795–806.
  43. Alexander LeClair and Collin McMillan. 2019. Recommendations for Datasets for Source Code Summarization. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. 3931–3937.
  44. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In 58th Annual Meeting of the Association for Computational Linguistics, ACL. 7871–7880.
  45. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  46. How do Developers Document Database Usages in Source Code?. In 30th IEEE/ACM International Conference on Automated Software Engineering, ASE. 36–41.
  47. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). https://doi.org/10.48550/arXiv.1907.11692
  48. Automatic Detection of Outdated Comments During Code Changes. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Vol. 01. 154–163.
  49. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR.
  50. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  51. FacTeR-Check: Semi-automated fact-checking through semantic similarity and natural language inference. Knowl. Based Syst. 251 (2022), 109265.
  52. P. W. McBurney and C. McMillan. 2016. Automatic Source Code Summarization of Context for Java Methods. IEEE Transactions on Software Engineering, TSE 42, 2 (2016), 103–119.
  53. Automatic generation of natural language summaries for java classes. In Program Comprehension (ICPC), 2013 IEEE 21st International Conference on. IEEE, 23–32.
  54. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4004–4012.
  55. A. N. Oppenheim. 1992. Questionnaire Design, Interviewing and Attitude Measurement. Pinter Publishers.
  56. Improved Text Classification via Contrastive Adversarial Training. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022. 11130–11138.
  57. Improved text classification via contrastive adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11130–11138.
  58. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  59. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  60. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the tenth workshop on statistical machine translation. 392–395.
  61. Aniket Potdar and Emad Shihab. 2014. An Exploratory Study on Self-Admitted Technical Debt. In 2014 IEEE International Conference on Software Maintenance and Evolution. 91–100.
  62. R Core Team. 2020. R: A Language and Environment for Statistical Computing. https://www.R-project.org/
  63. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67.
  64. Recommending insightful comments for source code using crowdsourced knowledge. In 15th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM. 81–90.
  65. Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proc. of the 1st instructional conf. on machine learning, Vol. 242. 29–48.
  66. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
  67. Detecting user story information in developer-client conversations to generate extractive summaries. In 39th IEEE/ACM International Conference on Software Engineering, ICSE. 49–59.
  68. Reassessing Automatic Evaluation Metrics for Code Summarization Tasks. In 29th ACM Joint Meeting on European Software Engineering Conference and the ACM/SIGSOFT Symposium on the Foundations of Software Engineering, ESEC-FSE. 1105–1116.
  69. Hadeel Saadany and Constantin Orasan. 2021. BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text. CoRR abs/2109.14250 (2021). https://doi.org/10.48550/arXiv.2109.14250
  70. FaceNet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 815–823.
  71. MPNet: Masked and Permuted Pre-training for Language Understanding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems.
  72. D. Spinellis. 2010. Code Documentation. IEEE Software 27, 4 (2010), 18–19.
  73. Automatically detecting and describing high level actions within methods. In 2011 33rd International Conference on Software Engineering (ICSE). IEEE, 101–110.
  74. A Human Study of Comprehension and Code Summarization. In Proceedings of the 28th International Conference on Program Comprehension (ICPC ’20). 2–13.
  75. Quality analysis of source code comments. In Proc. of the 21st IEEE International Conference on Program Comprehension (ICPC). 83–92.
  76. Quality analysis of source code comments. In 2013 21st International Conference on Program Comprehension (ICPC). 83–92.
  77. /*icomment: bugs or bad comments?*/. In 21st ACM Symposium on Operating Systems Principles, SOSP. 145–158.
  78. HotComments: How to Make Program Comments More Useful?. In Proceedings of HotOS’07: 11th Workshop on Hot Topics in Operating Systems, Galen C. Hunt (Ed.). USENIX Association.
  79. @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. 260–269.
  80. T. Tenny. 1988. Program Readability: Procedures Versus Comments. IEEE Trans. Softw. Eng. 14, 9 (Sept. 1988), 1271–1279.
  81. Attention is All you Need. In 30th Advances in Neural Information Processing Systems. 5998–6008.
  82. W. N. Venables and B. D. Ripley. 2002. Modern Applied Statistics with S (fourth ed.). Springer, New York. http://www.stats.ox.ac.uk/pub/MASS4 ISBN 0-387-95457-0.
  83. Deep Code-Comment Understanding and Assessment. IEEE Access 7 (2019), 174200–174209.
  84. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. arXiv:2305.07922 [cs.CL]
  85. Semi-supervised clustering with contrastive learning for discovering new intents. arXiv preprint arXiv:2201.07604 (2022).
  86. A large-scale empirical study on code-comment inconsistencies. In 27th International Conference on Program Comprehension, ICPC. 53–64.
  87. CloCom: Mining existing source code for automatic comment generation. In Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. 380–389.
  88. Autocomment: Mining question and answer sites for automatic comment generation. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering. 562–567.
  89. The Effect of Modularization and Comments on Program Comprehension. In Proceedings of the 5th International Conference on Software Engineering (ICSE ’81). 215–223.
  90. Measuring Program Comprehension: A Large-Scale Field Study with Professionals. IEEE Transactions on Software Engineering, TSE 44, 10 (2018), 951–976.
  91. Graph2seq: Graph to sequence learning with attention-based neural networks. arXiv preprint arXiv:1804.00823 (2018).
  92. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems. 5754–5764.
  93. Retrieval-based neural source code summarization. In 42nd IEEE/ACM International Conference on Software Engineering ICSE. 1385–1397.
  94. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In 37th International Conference on Machine Learning, ICML. 11328–11339.
  95. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
  96. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR.
  97. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In 2015 IEEE International Conference on Computer Vision, ICCV. 19–27.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Antonio Mastropaolo (25 papers)
  2. Matteo Ciniselli (11 papers)
  3. Massimiliano Di Penta (31 papers)
  4. Gabriele Bavota (60 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com