Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EffiBench: Benchmarking the Efficiency of Automatically Generated Code (2402.02037v5)

Published 3 Feb 2024 in cs.SE and cs.CL

Abstract: Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 LLMs (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Leetcode. https://leetcode.com/. Accessed: January 31, 2024.
  2. Unified pre-training for program understanding and generation. ArXiv, abs/2103.06333, 2021. URL https://api.semanticscholar.org/CorpusID:232185260.
  3. Program synthesis with large language models. ArXiv, abs/2108.07732, 2021. URL https://api.semanticscholar.org/CorpusID:237142385.
  4. Hiring is broken: What do developers say about technical interviews? In 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp.  1–9. IEEE, 2019.
  5. Bell, B. A. Understanding the Preparation Phase of Technical Interviews. PhD thesis, Virginia Tech, 2023.
  6. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. 2022. URL https://api.semanticscholar.org/CorpusID:254854172.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
  8. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021b. URL https://api.semanticscholar.org/CorpusID:235755472.
  9. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022. URL https://api.semanticscholar.org/CorpusID:247951931.
  10. Pangu-coder: Program synthesis with function-level language modeling. ArXiv, abs/2207.11280, 2022. URL https://api.semanticscholar.org/CorpusID:251040785.
  11. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1536–1547, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
  12. Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022. URL https://api.semanticscholar.org/CorpusID:248157108.
  13. Google, G. T. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023. URL https://api.semanticscholar.org/CorpusID:266361876.
  14. Graphcodebert: Pre-training code representations with data flow. ArXiv, abs/2009.08366, 2020.
  15. Aixbench: A code generation benchmark dataset. ArXiv, abs/2206.13179, 2022. URL https://api.semanticscholar.org/CorpusID:250072468.
  16. Harper, J. Interview insight: How to get the job. In A Software Engineer’s Guide to Seniority: A Guide to Technical Leadership, pp.  19–28. Springer, 2022.
  17. Measuring coding challenge competence with apps. NeurIPS, 2021.
  18. Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
  19. Ds-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022. URL https://api.semanticscholar.org/CorpusID:253734939.
  20. Starcoder: may the source be with you! ArXiv, abs/2305.06161, 2023a. URL https://api.semanticscholar.org/CorpusID:258588247.
  21. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023b.
  22. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=1qvx610Cu7.
  23. Wizardcoder: Empowering code large language models with evol-instruct. ArXiv, abs/2306.08568, 2023. URL https://api.semanticscholar.org/CorpusID:259164815.
  24. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.
  25. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  26. Code llama: Open foundation models for code. ArXiv, abs/2308.12950, 2023. URL https://api.semanticscholar.org/CorpusID:261100919.
  27. Shyamasundar, R. K. Introduction to algorithms. Resonance, 1:14–24, 1996. URL https://api.semanticscholar.org/CorpusID:123556377.
  28. Recode: Robustness evaluation of code generation models. ArXiv, abs/2212.10264, 2022. URL https://api.semanticscholar.org/CorpusID:254877229.
  29. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
  30. Magicoder: Source code is all you need. ArXiv, abs/2312.02120, 2023. URL https://api.semanticscholar.org/CorpusID:265609970.
  31. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. ArXiv, abs/2302.00288, 2023. URL https://api.semanticscholar.org/CorpusID:256459413.
  32. CERT: Continual pre-training on sketches for library-oriented code generation. In The 2022 International Joint Conference on Artificial Intelligence, 2022.
  33. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. ArXiv, abs/2303.17568, 2023. URL https://api.semanticscholar.org/CorpusID:257834177.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dong Huang (102 papers)
  2. Jie M. Zhang (39 papers)
  3. Yuhao Qing (11 papers)
  4. Heming Cui (29 papers)
  5. Weiyi Shang (17 papers)
Citations (16)