Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 83 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

Evaluation of large language models for assessing code maintainability (2401.12714v1)

Published 23 Jan 2024 in cs.SE and cs.AI

Abstract: Increased availability of open-source software repositories and recent advances in code analysis using LLMs has triggered a wave of new work to automate software engineering tasks that were previously very difficult to automate. In this paper, we investigate a recent line of work that hypothesises that comparing the probability of code generated by LLMs with the probability the current code would have had can indicate potential quality problems. We investigate the association between the cross-entropy of code generated by ten different models (based on GPT2 and Llama2) and the following quality aspects: readability, understandability, complexity, modularisation, and overall maintainability assessed by experts and available in an benchmark dataset. Our results show that, controlling for the number of logical lines of codes (LLOC), cross-entropy computed by LLMs is indeed a predictor of maintainability on a class level (the higher the cross-entropy the lower the maintainability). However, this relation is reversed when one does not control for LLOC (e.g., comparing small classes with longer ones). Furthermore, while the complexity of LLMs affects the range of cross-entropy (smaller models tend to have a wider range of cross-entropy), this plays a significant role in predicting maintainability aspects. Our study limits itself on ten different pretrained models (based on GPT2 and Llama2) and on maintainability aspects collected by Schnappinger et al. When controlling for logical lines of code (LLOC), cross-entropy is a predictor of maintainability. However, while related work has shown the potential usefulness of cross-entropy at the level of tokens or short sequences, at the class level this criterion alone may prove insufficient to predict maintainability and further research is needed to make best use of this information in practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)
  1. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4), jul 2018.
  2. Exploring the relationships between design measures and software quality in object-oriented systems. Journal of Systems and Software, 51(3):245–273, 2000.
  3. Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861–874, 2006. ROC Analysis in Pattern Recognition.
  4. A critique of software defect prediction models. IEEE Transactions on software engineering, 25(5):675–689, 1999.
  5. Measuring code maintainability with deep neural networks. Frontiers of Computer Science, 17, 01 2023.
  6. Judea Pearl. Understanding simpson’s paradox. Technical report, Computer Science Department. University of California, Los Angeles, 2013.
  7. Defining a software maintainability dataset: Collecting, aggregating and analysing expert evaluations of software maintainability. In ICSME 2020. IEEE International Conference on Software Maintenance and Evolution. IEEE, 2020.
  8. A software maintainability dataset, Sep 2020.
  9. A preliminary study on using text- and image-based machine learning to predict software maintainability. In Daniel Mendez, Manuel Wimmer, Dietmar Winkler, Stefan Biffl, and Johannes Bergsmann, editors, Software Quality: The Next Big Thing in Software Engineering and Quality, pages 41–60, Cham, 2022. Springer International Publishing.
  10. Neural language models for code quality identification. In Proceedings of the 6th International Workshop on Machine Learning Techniques for Software Quality Evaluation, pages 5–10. ACM, November 2022.
  11. Martin Shepperd. A critique of cyclomatic complexity as a software metric. Software Engineering Journal, 3(2):30–36, 1988.
  12. Data-driven technical debt management: Software engineering or data science challenge? IEEE Software, 38(6):59–64, 2021.
  13. A survey of deep learning models for structural code understanding. arXiv preprint arXiv:2205.01293, 2022.
  14. A survey on deep learning for software engineering. ACM Computing Surveys (CSUR), 54(10s):1–73, 2022.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com