Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM (2403.19114v1)

Published 28 Mar 2024 in cs.SE, cs.CL, cs.LG, and cs.PL

Abstract: LLMs have become the go-to choice for code generation tasks, with an exponential increase in the training, development, and usage of LLMs specifically for code generation. To evaluate the ability of LLMs on code, both academic and industry practitioners rely on popular handcrafted benchmarks. However, prior benchmarks contain only a very limited set of problems, both in quantity and variety. Further, due to popularity and age, many benchmarks are prone to data leakage where example solutions can be readily found on the web and thus potentially in training data. Such limitations inevitably lead us to inquire: Is the leaderboard performance on existing benchmarks reliable and comprehensive enough to measure the program synthesis ability of LLMs? To address this, we introduce EvoEval -- a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains for a comprehensive evaluation of LLM coding abilities. Our study on 51 LLMs shows that compared to the high performance obtained on standard benchmarks like HumanEval, there is a significant drop in performance (on average 39.4%) when using EvoEval. Additionally, the decrease in performance can range from 19.6% to 47.7%, leading to drastic ranking changes amongst LLMs and showing potential overfitting of existing benchmarks. Furthermore, we showcase various insights, including the brittleness of instruction-following models when encountering rewording or subtle changes as well as the importance of learning problem composition and decomposition. EvoEval not only provides comprehensive benchmarks, but can be used to further evolve arbitrary problems to keep up with advances and the ever-changing landscape of LLMs for code. We have open-sourced our benchmarks, tools, and complete LLM generations at https://github.com/evo-eval/evoeval

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Anthropic. Introducing claude 2.1. https://www.anthropic.com/news/claude-2-1/, 2023.
  3. Anthropic. Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family/, 2024.
  4. Program synthesis with large language models, 2021.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  6. Introducing qwen1.5. https://qwenlm.github.io/blog/qwen1.5/, 2023b.
  7. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. arXiv preprint arXiv:2402.03927, 2024.
  8. BudEcosystem. Code millenials 34b. URL [https://huggingface.co/budecosystem/code-millenials-34b](https://huggingface.co/budecosystem/code-millenials-34b).
  9. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023.
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  11. Rephrase and respond: Let large language models ask better questions for themselves, 2023a.
  12. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In 32nd International Symposium on Software Testing and Analysis (ISSTA), 2023b.
  13. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861, 2023.
  14. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hQwb-lbM6EL.
  15. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017. ISSN 2325-1107. doi: 10.1561/2500000010. URL http://dx.doi.org/10.1561/2500000010.
  16. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  17. Measuring coding challenge competence with apps. NeurIPS, 2021.
  18. Stochastic neighbor embedding. Advances in neural information processing systems, 15, 2002.
  19. HuggingFace. Hugging face, 2022. https://huggingface.co.
  20. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint, 2024.
  21. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
  22. Self-planning code generation with large language model. arXiv preprint arXiv:2303.06689, 2023b.
  23. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66.
  24. Measuring compositional generalization: A comprehensive method on realistic data. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SygcCnNKwr.
  25. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp.  18319–18345. PMLR, 2023.
  26. Starcoder: may the source be with you!, 2023.
  27. Competition-level code generation with alphacode. Science, 2022. URL https://www.science.org/doi/abs/10.1126/science.abq1158.
  28. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=1qvx610Cu7.
  29. Starcoder 2 and the stack v2: The next generation, 2024.
  30. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  31. William M McKeeman. Differential testing for software. Digital Technical Journal, 10(1):100–107, 1998.
  32. Microsoft Research. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023.
  33. Mistral AI team. Mixtral of experts a high quality sparse mixture-of-experts. https://mistral.ai/news/mixtral-of-experts/, 2023.
  34. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
  35. OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
  36. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  37. phind team. Beating gpt-4 on humaneval with a fine-tuned codellama-34b. https://www.phind.com/blog/code-llama-beats-gpt4, 2023.
  38. Stable code 3b. URL [https://huggingface.co/stabilityai/stable-code-3b](https://huggingface.co/stabilityai/stable-code-3b).
  39. Quantifying contamination in evaluating code generation capabilities of language models. arXiv preprint arXiv:2403.04811, 2024.
  40. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  41. Exedec: Execution decomposition for compositional generalization in neural program synthesis. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=oTRwljRgiv.
  42. Jiangwen Su. Code millenials 34b. URL [https://huggingface.co/uukuguy/speechless-codellama-34b-v2.0](https://huggingface.co/uukuguy/speechless-codellama-34b-v2.0).
  43. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  44. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  45. Xwin-LM Team. Xwin-lm. https://github.com/Xwin-LM/Xwin-LM, 2023.
  46. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023a.
  47. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  48. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  8696–8708, 2021.
  49. Execution-based evaluation for open-domain code generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  1271–1290, 2023b.
  50. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
  51. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery, 2023.
  52. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pp.  1–10, 2022.
  53. Natural language to code generation in interactive data science notebooks. 2022.
  54. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023.
  55. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chunqiu Steven Xia (13 papers)
  2. Yinlin Deng (11 papers)
  3. Lingming Zhang (48 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com