A Performance Study of LLM-Generated Code on Leetcode (2407.21579v1)
Abstract: This study evaluates the efficiency of code generation by LLMs and measures their performance against human-crafted solutions using a dataset from Leetcode. We compare 18 LLMs, considering factors such as model temperature and success rate, and their impact on code performance. This research introduces a novel method for measuring and comparing the speed of LLM-generated code, revealing that LLMs produce code with comparable performance, irrespective of the adopted LLM. We also find that LLMs are capable of generating code that is, on average, more efficient than the code written by humans. The paper further discusses the use of Leetcode as a benchmarking dataset, the limitations imposed by potential data contamination, and the platform's measurement reliability. We believe that our findings contribute to a better understanding of LLM capabilities in code generation and set the stage for future optimizations in the field.
- 2022. Codeparrot/Codeparrot ⋅⋅\cdot⋅ Hugging Face. https://huggingface.co/codeparrot/codeparrot.
- The Impact of Source Code in Software on Power Consumption. International Journal of Electronic Business Management 14 (2016), 42–52.
- Hojjat Aghakhani et al. 2023. TrojanPuzzle: Covertly Poisoning Code-Suggestion Models. (2023). https://doi.org/10.48550/ARXIV.2301.02344
- Loubna Ben Allal et al. 2023. SantaCoder: Don’t Reach for the Stars! arXiv:2301.03988 [cs]
- Jacob Austin et al. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs]
- Daniel Balouek et al. 2013. Adding Virtualization Capabilities to the Grid’5000 Testbed. In Cloud Computing and Services Science. Communications in Computer and Information Science, Vol. 367. 3–20.
- Grounded Copilot: How Programmers Interact with Code-Generating Models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (April 2023), 78:85–78:111.
- Sébastien Bubeck et al. 2023. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv:2303.12712 [cs]
- Learning to Improve Code Efficiency. arXiv:2208.05297 [cs]
- Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs]
- Supersonic: Learning to Generate Source Code Optimizations in C/C++. arXiv:2309.14846 [cs]
- Fenia Christopoulou et al. 2022. PanGu-Coder: Program Synthesis with Function-Level Language Modeling. arXiv:2207.11280 [cs]
- Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic? https://doi.org/10.2139/ssrn.4496380
- InCoder: A Generative Model for Code Infilling and Synthesis. arXiv:2204.05999 [cs]
- Spandan Garg et al. 2022. DeepDev-PERF: A Deep Learning-Based Approach for Improving Software Performance. In Proceedings of the 30th ACM Joint ESEC/FSE. 948–958. https://doi.org/10.1145/3540250.3549096
- Spandan Garg et al. 2023. RAPGen: An Approach for Fixing Code Inefficiencies in Zero-Shot. arXiv:2306.17077 [cs]
- Dan Hendrycks et al. 2021. Measuring Coding Challenge Competence With APPS. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (Dec. 2021).
- The Curious Case of Neural Text Degeneration. arXiv:1904.09751 [cs]
- Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. arXiv:2305.10160 [cs]
- Kevin Jesse et al. 2023. Large Language Models and Simple, Stupid Bugs. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). 563–575. https://doi.org/10.1109/MSR59073.2023.00082
- Raymond Li et al. 2023. StarCoder: May the Source Be with You! (2023). arXiv:2305.06161 [cs.CL]
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).
- Shuai Lu et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664 [cs]
- Ziyang Luo et al. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568 [cs]
- Aman Madaan et al. 2023. Learning Performance-Improving Code Edits.
- Nhan Nguyen and Sarah Nadi. 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories. 1–5. https://doi.org/10.1145/3524842.3528470
- Erik Nijkamp et al. 2022. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs]
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs]
- Hammond Pearce et al. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In 2022 IEEE Symposium on Security and Privacy (SP). 754–768. https://doi.org/10.1109/SP46214.2022.9833571
- Neil Perry et al. 2023. Do Users Write More Insecure Code with AI Assistants?. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23). 2785–2799. https://doi.org/10.1145/3576915.3623157
- Ruchir Puri et al. 2021. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (Dec. 2021).
- Baptiste Rozière et al. 2023. Code Llama: Open Foundation Models for Code. https://doi.org/10.48550/arXiv.2308.12950 arXiv:2308.12950 [cs]
- Gustavo Sandoval et al. 2023. Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants. In 32nd USENIX Security Symposium (USENIX Security 23). 2205–2222.
- Priyan Vaithilingam et al. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–7. https://doi.org/10.1145/3491101.3519665
- Helena Vasconcelos et al. 2023. Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions. (2023). https://doi.org/10.48550/ARXIV.2302.07248
- Roberto Verdecchia et al. 2017. Estimating Energy Impact of Software Releases and Deployment Strategies: The KPMG Case Study. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 257–266. https://doi.org/10.1109/ESEM.2017.39
- Shiqi Wang et al. 2022. ReCode: Robustness Evaluation of Code Generation Models. https://doi.org/10.48550/ARXIV.2212.10264
- Yue Wang et al. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv:2109.00859 [cs]
- Jason Wei et al. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs]
- Frank F. Xu et al. 2022. A Systematic Evaluation of Large Language Models of Code. In Proceedings of the 6th International Symposium on Machine Programming (MAPS 2022). 1–10. https://doi.org/10.1145/3520312.3534862
- Weixiang Yan et al. 2022. WhyGen: Explaining ML-powered Code Generation by Referring to Training Examples. In Proceedings of the 44th Int. Conf. on Software Engineering, vol. 2 (ICSE ’22). 237–241. https://doi.org/10.1145/3510454.3516866
- Burak Yetistiren et al. 2022. Assessing the Quality of GitHub Copilot’s Code Generation. In Proceedings of the 18th Int. Conf. on Predictive Models and Data Analytics in Software Engineering. 62–71. https://doi.org/10.1145/3558489.3559072
- Hao Yu et al. 2023. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. arXiv:2302.00288 [cs]
- Tristan Coignion (2 papers)
- Clément Quinton (7 papers)
- Romain Rouvoy (19 papers)