Evaluation of large language models for assessing code maintainability (2401.12714v1)
Abstract: Increased availability of open-source software repositories and recent advances in code analysis using LLMs has triggered a wave of new work to automate software engineering tasks that were previously very difficult to automate. In this paper, we investigate a recent line of work that hypothesises that comparing the probability of code generated by LLMs with the probability the current code would have had can indicate potential quality problems. We investigate the association between the cross-entropy of code generated by ten different models (based on GPT2 and Llama2) and the following quality aspects: readability, understandability, complexity, modularisation, and overall maintainability assessed by experts and available in an benchmark dataset. Our results show that, controlling for the number of logical lines of codes (LLOC), cross-entropy computed by LLMs is indeed a predictor of maintainability on a class level (the higher the cross-entropy the lower the maintainability). However, this relation is reversed when one does not control for LLOC (e.g., comparing small classes with longer ones). Furthermore, while the complexity of LLMs affects the range of cross-entropy (smaller models tend to have a wider range of cross-entropy), this plays a significant role in predicting maintainability aspects. Our study limits itself on ten different pretrained models (based on GPT2 and Llama2) and on maintainability aspects collected by Schnappinger et al. When controlling for logical lines of code (LLOC), cross-entropy is a predictor of maintainability. However, while related work has shown the potential usefulness of cross-entropy at the level of tokens or short sequences, at the class level this criterion alone may prove insufficient to predict maintainability and further research is needed to make best use of this information in practice.
- A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4), jul 2018.
- Exploring the relationships between design measures and software quality in object-oriented systems. Journal of Systems and Software, 51(3):245–273, 2000.
- Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861–874, 2006. ROC Analysis in Pattern Recognition.
- A critique of software defect prediction models. IEEE Transactions on software engineering, 25(5):675–689, 1999.
- Measuring code maintainability with deep neural networks. Frontiers of Computer Science, 17, 01 2023.
- Judea Pearl. Understanding simpson’s paradox. Technical report, Computer Science Department. University of California, Los Angeles, 2013.
- Defining a software maintainability dataset: Collecting, aggregating and analysing expert evaluations of software maintainability. In ICSME 2020. IEEE International Conference on Software Maintenance and Evolution. IEEE, 2020.
- A software maintainability dataset, Sep 2020.
- A preliminary study on using text- and image-based machine learning to predict software maintainability. In Daniel Mendez, Manuel Wimmer, Dietmar Winkler, Stefan Biffl, and Johannes Bergsmann, editors, Software Quality: The Next Big Thing in Software Engineering and Quality, pages 41–60, Cham, 2022. Springer International Publishing.
- Neural language models for code quality identification. In Proceedings of the 6th International Workshop on Machine Learning Techniques for Software Quality Evaluation, pages 5–10. ACM, November 2022.
- Martin Shepperd. A critique of cyclomatic complexity as a software metric. Software Engineering Journal, 3(2):30–36, 1988.
- Data-driven technical debt management: Software engineering or data science challenge? IEEE Software, 38(6):59–64, 2021.
- A survey of deep learning models for structural code understanding. arXiv preprint arXiv:2205.01293, 2022.
- A survey on deep learning for software engineering. ACM Computing Surveys (CSUR), 54(10s):1–73, 2022.
Collections
Sign up for free to add this paper to one or more collections.