2000 character limit reached
On Evaluating the Efficiency of Source Code Generated by LLMs (2404.06041v1)
Published 9 Apr 2024 in cs.SE
Abstract: Recent years have seen the remarkable capabilities of LLMs for code generation. Different from existing work that evaluate the correctness of the code generated by LLMs, we propose to further evaluate its efficiency. More efficient code can lead to higher performance and execution efficiency of programs and software completed by LLM-assisted programming. First, we evaluate the efficiency of the code generated by LLMs on two benchmarks, HumanEval and MBPP. Then, we choose a set of programming problems from the online judge platform LeetCode to conduct a more difficult evaluation. Finally, we explore several prompts that would enable LLMs to generate more efficient code.
- [n. d.]. https://github.com/NougatCA/EfficiencyEval.
- [n. d.]. Codeforces. https://codeforces.com/.
- [n. d.]. LeetCode. https://leetcode.com/.
- Ayaz Akram and Lina Sawalha. 2019. Validation of the gem5 simulator for x86 architectures. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 53–58.
- Anthropic. [n. d.]. Introducing Claude. https://www.anthropic.com/index/introducing-claude.
- Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv preprint arXiv:2401.02954 (2024).
- The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7.
- CodeT: Code Generation with Generated Tests. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=ktrw68Cmu9c
- Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
- DeepSeek. 2023. DeepSeek Coder: Let the Code Write Itself. https://github.com/deepseek-ai/DeepSeek-Coder.
- DeepDev-PERF: a deep learning-based approach for improving software performance. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 948–958.
- GitHub. [n. d.]. GitHub Copilot. https://github.com/features/copilot.
- Google. [n. d.]. Bard. https://bard.google.com/.
- ANPL: Towards Natural Programming with Interactive Decomposition. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=RTRS3ZTsSj
- JetBrains. [n. d.]. JetBrains AI. https://www.jetbrains.com/ai/.
- Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Thirty-seventh Conference on Neural Information Processing Systems.
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568 [cs.CL]
- Learning Performance-Improving Code Edits. In International Conference on Learning Representations.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).
- Microsoft. 2023. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
- A Lightweight Framework for High-Quality Code Generation. arXiv preprint arXiv:2307.08220 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt. arXiv preprint arXiv:2304.10778 (2023).
- Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=qd9qcbVAwQ
- Changan Niu (7 papers)
- Ting Zhang (174 papers)
- Chuanyi Li (16 papers)
- Bin Luo (209 papers)
- Vincent Ng (24 papers)