Papers
Topics
Authors
Recent
Search
2000 character limit reached

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

Published 3 Feb 2024 in cs.SE and cs.CL | (2402.02037v6)

Abstract: Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 LLMs (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Leetcode. https://leetcode.com/. Accessed: January 31, 2024.
  2. Unified pre-training for program understanding and generation. ArXiv, abs/2103.06333, 2021. URL https://api.semanticscholar.org/CorpusID:232185260.
  3. Program synthesis with large language models. ArXiv, abs/2108.07732, 2021. URL https://api.semanticscholar.org/CorpusID:237142385.
  4. Hiring is broken: What do developers say about technical interviews? In 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp.  1–9. IEEE, 2019.
  5. Bell, B. A. Understanding the Preparation Phase of Technical Interviews. PhD thesis, Virginia Tech, 2023.
  6. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. 2022. URL https://api.semanticscholar.org/CorpusID:254854172.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
  8. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021b. URL https://api.semanticscholar.org/CorpusID:235755472.
  9. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022. URL https://api.semanticscholar.org/CorpusID:247951931.
  10. Pangu-coder: Program synthesis with function-level language modeling. ArXiv, abs/2207.11280, 2022. URL https://api.semanticscholar.org/CorpusID:251040785.
  11. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1536–1547, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
  12. Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022. URL https://api.semanticscholar.org/CorpusID:248157108.
  13. Google, G. T. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023. URL https://api.semanticscholar.org/CorpusID:266361876.
  14. Graphcodebert: Pre-training code representations with data flow. ArXiv, abs/2009.08366, 2020.
  15. Aixbench: A code generation benchmark dataset. ArXiv, abs/2206.13179, 2022. URL https://api.semanticscholar.org/CorpusID:250072468.
  16. Harper, J. Interview insight: How to get the job. In A Software Engineer’s Guide to Seniority: A Guide to Technical Leadership, pp.  19–28. Springer, 2022.
  17. Measuring coding challenge competence with apps. NeurIPS, 2021.
  18. Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
  19. Ds-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022. URL https://api.semanticscholar.org/CorpusID:253734939.
  20. Starcoder: may the source be with you! ArXiv, abs/2305.06161, 2023a. URL https://api.semanticscholar.org/CorpusID:258588247.
  21. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023b.
  22. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=1qvx610Cu7.
  23. Wizardcoder: Empowering code large language models with evol-instruct. ArXiv, abs/2306.08568, 2023. URL https://api.semanticscholar.org/CorpusID:259164815.
  24. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.
  25. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  26. Code llama: Open foundation models for code. ArXiv, abs/2308.12950, 2023. URL https://api.semanticscholar.org/CorpusID:261100919.
  27. Shyamasundar, R. K. Introduction to algorithms. Resonance, 1:14–24, 1996. URL https://api.semanticscholar.org/CorpusID:123556377.
  28. Recode: Robustness evaluation of code generation models. ArXiv, abs/2212.10264, 2022. URL https://api.semanticscholar.org/CorpusID:254877229.
  29. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
  30. Magicoder: Source code is all you need. ArXiv, abs/2312.02120, 2023. URL https://api.semanticscholar.org/CorpusID:265609970.
  31. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. ArXiv, abs/2302.00288, 2023. URL https://api.semanticscholar.org/CorpusID:256459413.
  32. CERT: Continual pre-training on sketches for library-oriented code generation. In The 2022 International Joint Conference on Artificial Intelligence, 2022.
  33. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. ArXiv, abs/2303.17568, 2023. URL https://api.semanticscholar.org/CorpusID:257834177.
Citations (16)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 3 likes about this paper.