Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Performance Study of LLM-Generated Code on Leetcode (2407.21579v1)

Published 31 Jul 2024 in cs.SE and cs.AI

Abstract: This study evaluates the efficiency of code generation by LLMs and measures their performance against human-crafted solutions using a dataset from Leetcode. We compare 18 LLMs, considering factors such as model temperature and success rate, and their impact on code performance. This research introduces a novel method for measuring and comparing the speed of LLM-generated code, revealing that LLMs produce code with comparable performance, irrespective of the adopted LLM. We also find that LLMs are capable of generating code that is, on average, more efficient than the code written by humans. The paper further discusses the use of Leetcode as a benchmarking dataset, the limitations imposed by potential data contamination, and the platform's measurement reliability. We believe that our findings contribute to a better understanding of LLM capabilities in code generation and set the stage for future optimizations in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. 2022. Codeparrot/Codeparrot ⋅⋅\cdot⋅ Hugging Face. https://huggingface.co/codeparrot/codeparrot.
  2. The Impact of Source Code in Software on Power Consumption. International Journal of Electronic Business Management 14 (2016), 42–52.
  3. Hojjat Aghakhani et al. 2023. TrojanPuzzle: Covertly Poisoning Code-Suggestion Models. (2023). https://doi.org/10.48550/ARXIV.2301.02344
  4. Loubna Ben Allal et al. 2023. SantaCoder: Don’t Reach for the Stars! arXiv:2301.03988 [cs]
  5. Jacob Austin et al. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs]
  6. Daniel Balouek et al. 2013. Adding Virtualization Capabilities to the Grid’5000 Testbed. In Cloud Computing and Services Science. Communications in Computer and Information Science, Vol. 367. 3–20.
  7. Grounded Copilot: How Programmers Interact with Code-Generating Models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (April 2023), 78:85–78:111.
  8. Sébastien Bubeck et al. 2023. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv:2303.12712 [cs]
  9. Learning to Improve Code Efficiency. arXiv:2208.05297 [cs]
  10. Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs]
  11. Supersonic: Learning to Generate Source Code Optimizations in C/C++. arXiv:2309.14846 [cs]
  12. Fenia Christopoulou et al. 2022. PanGu-Coder: Program Synthesis with Function-Level Language Modeling. arXiv:2207.11280 [cs]
  13. Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic? https://doi.org/10.2139/ssrn.4496380
  14. InCoder: A Generative Model for Code Infilling and Synthesis. arXiv:2204.05999 [cs]
  15. Spandan Garg et al. 2022. DeepDev-PERF: A Deep Learning-Based Approach for Improving Software Performance. In Proceedings of the 30th ACM Joint ESEC/FSE. 948–958. https://doi.org/10.1145/3540250.3549096
  16. Spandan Garg et al. 2023. RAPGen: An Approach for Fixing Code Inefficiencies in Zero-Shot. arXiv:2306.17077 [cs]
  17. Dan Hendrycks et al. 2021. Measuring Coding Challenge Competence With APPS. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (Dec. 2021).
  18. The Curious Case of Neural Text Degeneration. arXiv:1904.09751 [cs]
  19. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. arXiv:2305.10160 [cs]
  20. Kevin Jesse et al. 2023. Large Language Models and Simple, Stupid Bugs. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). 563–575. https://doi.org/10.1109/MSR59073.2023.00082
  21. Raymond Li et al. 2023. StarCoder: May the Source Be with You! (2023). arXiv:2305.06161 [cs.CL]
  22. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36 (2024).
  23. Shuai Lu et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664 [cs]
  24. Ziyang Luo et al. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568 [cs]
  25. Aman Madaan et al. 2023. Learning Performance-Improving Code Edits.
  26. Nhan Nguyen and Sarah Nadi. 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories. 1–5. https://doi.org/10.1145/3524842.3528470
  27. Erik Nijkamp et al. 2022. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs]
  28. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs]
  29. Hammond Pearce et al. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In 2022 IEEE Symposium on Security and Privacy (SP). 754–768. https://doi.org/10.1109/SP46214.2022.9833571
  30. Neil Perry et al. 2023. Do Users Write More Insecure Code with AI Assistants?. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23). 2785–2799. https://doi.org/10.1145/3576915.3623157
  31. Ruchir Puri et al. 2021. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (Dec. 2021).
  32. Baptiste Rozière et al. 2023. Code Llama: Open Foundation Models for Code. https://doi.org/10.48550/arXiv.2308.12950 arXiv:2308.12950 [cs]
  33. Gustavo Sandoval et al. 2023. Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants. In 32nd USENIX Security Symposium (USENIX Security 23). 2205–2222.
  34. Priyan Vaithilingam et al. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–7. https://doi.org/10.1145/3491101.3519665
  35. Helena Vasconcelos et al. 2023. Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions. (2023). https://doi.org/10.48550/ARXIV.2302.07248
  36. Roberto Verdecchia et al. 2017. Estimating Energy Impact of Software Releases and Deployment Strategies: The KPMG Case Study. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 257–266. https://doi.org/10.1109/ESEM.2017.39
  37. Shiqi Wang et al. 2022. ReCode: Robustness Evaluation of Code Generation Models. https://doi.org/10.48550/ARXIV.2212.10264
  38. Yue Wang et al. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv:2109.00859 [cs]
  39. Jason Wei et al. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs]
  40. Frank F. Xu et al. 2022. A Systematic Evaluation of Large Language Models of Code. In Proceedings of the 6th International Symposium on Machine Programming (MAPS 2022). 1–10. https://doi.org/10.1145/3520312.3534862
  41. Weixiang Yan et al. 2022. WhyGen: Explaining ML-powered Code Generation by Referring to Training Examples. In Proceedings of the 44th Int. Conf. on Software Engineering, vol. 2 (ICSE ’22). 237–241. https://doi.org/10.1145/3510454.3516866
  42. Burak Yetistiren et al. 2022. Assessing the Quality of GitHub Copilot’s Code Generation. In Proceedings of the 18th Int. Conf. on Predictive Models and Data Analytics in Software Engineering. 62–71. https://doi.org/10.1145/3558489.3559072
  43. Hao Yu et al. 2023. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. arXiv:2302.00288 [cs]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tristan Coignion (2 papers)
  2. Clément Quinton (7 papers)
  3. Romain Rouvoy (19 papers)
Citations (10)