EffiBench: Benchmarking the Efficiency of Automatically Generated Code (2402.02037v5)
Abstract: Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 LLMs (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.
- Leetcode. https://leetcode.com/. Accessed: January 31, 2024.
- Unified pre-training for program understanding and generation. ArXiv, abs/2103.06333, 2021. URL https://api.semanticscholar.org/CorpusID:232185260.
- Program synthesis with large language models. ArXiv, abs/2108.07732, 2021. URL https://api.semanticscholar.org/CorpusID:237142385.
- Hiring is broken: What do developers say about technical interviews? In 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 1–9. IEEE, 2019.
- Bell, B. A. Understanding the Preparation Phase of Technical Interviews. PhD thesis, Virginia Tech, 2023.
- Multipl-e: A scalable and extensible approach to benchmarking neural code generation. 2022. URL https://api.semanticscholar.org/CorpusID:254854172.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
- Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021b. URL https://api.semanticscholar.org/CorpusID:235755472.
- Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022. URL https://api.semanticscholar.org/CorpusID:247951931.
- Pangu-coder: Program synthesis with function-level language modeling. ArXiv, abs/2207.11280, 2022. URL https://api.semanticscholar.org/CorpusID:251040785.
- CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536–1547, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
- Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022. URL https://api.semanticscholar.org/CorpusID:248157108.
- Google, G. T. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023. URL https://api.semanticscholar.org/CorpusID:266361876.
- Graphcodebert: Pre-training code representations with data flow. ArXiv, abs/2009.08366, 2020.
- Aixbench: A code generation benchmark dataset. ArXiv, abs/2206.13179, 2022. URL https://api.semanticscholar.org/CorpusID:250072468.
- Harper, J. Interview insight: How to get the job. In A Software Engineer’s Guide to Seniority: A Guide to Technical Leadership, pp. 19–28. Springer, 2022.
- Measuring coding challenge competence with apps. NeurIPS, 2021.
- Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
- Ds-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022. URL https://api.semanticscholar.org/CorpusID:253734939.
- Starcoder: may the source be with you! ArXiv, abs/2305.06161, 2023a. URL https://api.semanticscholar.org/CorpusID:258588247.
- Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023b.
- Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=1qvx610Cu7.
- Wizardcoder: Empowering code large language models with evol-instruct. ArXiv, abs/2306.08568, 2023. URL https://api.semanticscholar.org/CorpusID:259164815.
- Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Code llama: Open foundation models for code. ArXiv, abs/2308.12950, 2023. URL https://api.semanticscholar.org/CorpusID:261100919.
- Shyamasundar, R. K. Introduction to algorithms. Resonance, 1:14–24, 1996. URL https://api.semanticscholar.org/CorpusID:123556377.
- Recode: Robustness evaluation of code generation models. ArXiv, abs/2212.10264, 2022. URL https://api.semanticscholar.org/CorpusID:254877229.
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
- Magicoder: Source code is all you need. ArXiv, abs/2312.02120, 2023. URL https://api.semanticscholar.org/CorpusID:265609970.
- Codereval: A benchmark of pragmatic code generation with generative pre-trained models. ArXiv, abs/2302.00288, 2023. URL https://api.semanticscholar.org/CorpusID:256459413.
- CERT: Continual pre-training on sketches for library-oriented code generation. In The 2022 International Joint Conference on Artificial Intelligence, 2022.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. ArXiv, abs/2303.17568, 2023. URL https://api.semanticscholar.org/CorpusID:257834177.
- Dong Huang (102 papers)
- Jie M. Zhang (39 papers)
- Yuhao Qing (11 papers)
- Heming Cui (29 papers)
- Weiyi Shang (17 papers)