Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (2403.07974v2)

Published 12 Mar 2024 in cs.SE, cs.CL, and cs.LG

Abstract: LLMs applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

Analysis of "LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code"

"LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code" presents a novel benchmarking framework tailored to assess the coding capabilities of LLMs. The benchmark aims to address several limitations identified in existing evaluation methods, particularly contamination and limited scope in assessing code-related tasks.

Core Contributions

This paper introduces a benchmark, LiveCodeBench, that tackles two main challenges: first, the contamination of evaluation datasets due to training overlaps, and second, the constraint of assessing LLMs on mere code generation tasks. LiveCodeBench proposes a continuous evaluation framework that includes problems from platforms such as LeetCode, AtCoder, and CodeForces, added as new contests emerge. This ensures that LLMs are evaluated on problems they likely have not encountered prior to their development, thus mitigating contamination risks. Furthermore, evaluations are expanded beyond code generation, encompassing self-repair, code execution, and test output prediction—tasks that reflect multifaceted programming abilities required in real-world scenarios.

Empirical Findings and Benchmarks

The authors evaluate 29 models across different tasks, revealing several insightful observations. Notably, the potential contamination of datasets used in LLM training is demonstrated through an evident drop in the performance of DeepSeek models on coding problems released post the models’ cut-off dates. This reinforces the necessity of LiveCodeBench's evolving test set. Additionally, the diverse task scenarios highlight variances in model capabilities. For example, models such as Claude-3-Opus show strengths in code execution tasks, suggesting a variance in aptitude across different code-related tasks that standard benchmarks fail to capture.

Furthermore, performance on the HumanEval benchmark—a commonly used LLM evaluation—appears to potentially overestimate certain model capabilities compared to LiveCodeBench. This discrepancy suggests that HumanEval may not fully assess the broader coding capabilities required, underscoring the importance of LiveCodeBench's holistic evaluation approach.

Implications and Future Work

The proposed LiveCodeBench offers several implications for future research and development of code LLMs. By maintaining a live, dynamic test set, it ensures evaluations reflect current and realistic coding challenges, encouraging model development focused on general and robust code understanding capabilities rather than narrow task specialization or overfitting specific datasets. The performance implications suggest promising directions for model tuning across varied facets of programming tasks, potentially leading to more comprehensive programming assistance tools.

Looking forward, extending LiveCodeBench to include more diverse problem domains—beyond competition programming—and supporting multiple programming languages can further enhance its utility and applicability in diverse software engineering contexts. Integrating these aspects would provide a more nuanced understanding of LLMs' capabilities and push for developments that more closely align with the real-world utility of such models in complex, multi-language coding environments.

In conclusion, LiveCodeBench sets a new precedent for evaluating code LLMs by addressing critical evaluation challenges and extending the scope of assessment. This provides a stronger foundation for understanding model strengths and gaps, guiding future advancements in AI that better meet the nuanced demands of software development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Juice: A large scale distantly supervised dataset for open domain context-based code generation. arXiv preprint arXiv:1910.02216.
  2. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988.
  3. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400.
  4. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868.
  5. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  6. Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275.
  7. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
  8. Barry Boehm. 2006. A view of 20th and 21st century software engineering. In Proceedings of the 28th International Conference on Software Engineering, ICSE ’06, page 12–29, New York, NY, USA. Association for Computing Machinery.
  9. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227.
  10. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  12. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
  13. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007.
  14. Incoder: A generative model for code infilling and synthesis. preprint arXiv:2204.05999.
  15. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  16. Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
  17. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065.
  18. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  19. Codesc: A large code-description parallel dataset. arXiv preprint arXiv:2105.14220.
  20. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  21. Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143.
  22. Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016, pages 2073–2083. Association for Computational Linguistics.
  23. Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering, pages 1219–1231.
  24. Llm-assisted code cleaning for training accurate code generators. arXiv preprint arXiv:2311.14904.
  25. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770.
  26. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. arXiv preprint arXiv:2303.03004.
  27. Spoc: Search-based pseudocode to code. Advances in Neural Information Processing Systems, 32.
  28. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  29. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
  30. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 795–806. IEEE.
  31. Explaining competitive-level programming solutions using llms. arXiv preprint arXiv:2307.05337.
  32. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  33. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852.
  34. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
  35. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  36. Can we generate shellcodes via natural language? an empirical study. Automated Software Engineering, 29(1):30.
  37. Codemind: A framework to challenge large language models for code reasoning. arXiv preprint arXiv:2402.09664.
  38. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439.
  39. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210.
  40. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091.
  41. Starcoder 2 and the stack v2: The next generation.
  42. Repoagent: An llm-powered open-source framework for repository-level code documentation generation. arXiv preprint arXiv:2402.16667.
  43. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  44. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867.
  45. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  46. Nl2type: inferring javascript function types from natural language information. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 304–315. IEEE.
  47. Type4py: Practical deep similarity learning-based type inference for python. In Proceedings of the 44th International Conference on Software Engineering, pages 2241–2252.
  48. L2ceval: Evaluating language-to-code generation capabilities of large language models. arXiv preprint arXiv:2309.17446.
  49. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations.
  50. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  51. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896.
  52. R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article.
  53. Proving test set contamination for black-box language models. In The Twelfth International Conference on Learning Representations.
  54. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  55. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE.
  56. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
  57. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. In The Twelfth International Conference on Learning Representations.
  58. Quantifying contamination in evaluating code generation capabilities of language models.
  59. Code generation with alphacodium: From prompt engineering to flow engineering. arXiv preprint arXiv:2401.08500.
  60. To the cutoff… and beyond? a longitudinal perspective on LLM data contamination. In The Twelfth International Conference on Learning Representations.
  61. Phind.
  62. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  63. Did chatgpt cheat on your test?
  64. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1):85–105.
  65. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
  66. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  67. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning, pages 31693–31715. PMLR.
  68. Nofuneval: Funny how code lms falter on requirements beyond functional correctness. arXiv preprint arXiv:2401.15963.
  69. Reinforcement learning from automatic feedback for high-quality unit test generation. arXiv preprint arXiv:2310.02368.
  70. Sql-palm: Improved large language modeladaptation for text-to-sql. arXiv preprint arXiv:2306.00739.
  71. Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
  72. Llmseceval: A dataset of natural language prompts for security evaluations. arXiv preprint arXiv:2303.09384.
  73. Methods2test: A dataset of focal methods mapped to test cases. In Proceedings of the 19th International Conference on Mining Software Repositories, pages 299–303.
  74. Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264.
  75. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
  76. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708.
  77. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481.
  78. On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 1398–1409.
  79. Typet5: Seq2seq type inference using static analysis. arXiv preprint arXiv:2303.09564.
  80. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120.
  81. ” according to…” prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252.
  82. Rethinking benchmark and contamination for language models with rephrased samples.
  83. Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714.
  84. Natural language to code generation in interactive data science notebooks. arXiv preprint arXiv:2212.09248.
  85. No more manual tests? evaluating and improving chatgpt for unit test generation. arXiv preprint arXiv:2305.04207.
  86. Parsel: A unified natural language framework for algorithmic reasoning. arXiv preprint arXiv:2212.10561.
  87. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570.
  88. Toolcoder: Teach code generation models to use apis with search tools. arXiv preprint arXiv:2305.04032.
  89. Self-edit: Fault-aware code editor for code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769–787, Toronto, Canada. Association for Computational Linguistics.
  90. Algo: Synthesizing algorithmic programs with generated oracle verifiers. arXiv preprint arXiv:2305.14591.
  91. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568.
  92. Opencodeinterpreter: Integrating code generation with execution and refinement. https://arxiv.org/abs/2402.14658.
  93. Codegen-test: An automatic code generation model integrating program test information. arXiv preprint arXiv:2202.07612.
  94. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Naman Jain (34 papers)
  2. King Han (1 paper)
  3. Alex Gu (20 papers)
  4. Wen-Ding Li (19 papers)
  5. Fanjia Yan (1 paper)
  6. Tianjun Zhang (38 papers)
  7. Sida Wang (21 papers)
  8. Armando Solar-Lezama (65 papers)
  9. Koushik Sen (49 papers)
  10. Ion Stoica (177 papers)
Citations (84)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com