Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Published 12 Mar 2024 in cs.SE, cs.CL, and cs.LG | (2403.07974v2)

Abstract: LLMs applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Juice: A large scale distantly supervised dataset for open domain context-based code generation. arXiv preprint arXiv:1910.02216.
  2. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988.
  3. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400.
  4. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868.
  5. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  6. Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275.
  7. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
  8. Barry Boehm. 2006. A view of 20th and 21st century software engineering. In Proceedings of the 28th International Conference on Software Engineering, ICSE ’06, page 12–29, New York, NY, USA. Association for Computing Machinery.
  9. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227.
  10. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  12. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
  13. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007.
  14. Incoder: A generative model for code infilling and synthesis. preprint arXiv:2204.05999.
  15. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  16. Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
  17. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065.
  18. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  19. Codesc: A large code-description parallel dataset. arXiv preprint arXiv:2105.14220.
  20. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  21. Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143.
  22. Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016, pages 2073–2083. Association for Computational Linguistics.
  23. Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering, pages 1219–1231.
  24. Llm-assisted code cleaning for training accurate code generators. arXiv preprint arXiv:2311.14904.
  25. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770.
  26. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. arXiv preprint arXiv:2303.03004.
  27. Spoc: Search-based pseudocode to code. Advances in Neural Information Processing Systems, 32.
  28. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  29. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
  30. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 795–806. IEEE.
  31. Explaining competitive-level programming solutions using llms. arXiv preprint arXiv:2307.05337.
  32. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  33. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852.
  34. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
  35. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  36. Can we generate shellcodes via natural language? an empirical study. Automated Software Engineering, 29(1):30.
  37. Codemind: A framework to challenge large language models for code reasoning. arXiv preprint arXiv:2402.09664.
  38. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439.
  39. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210.
  40. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091.
  41. Starcoder 2 and the stack v2: The next generation.
  42. Repoagent: An llm-powered open-source framework for repository-level code documentation generation. arXiv preprint arXiv:2402.16667.
  43. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  44. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867.
  45. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  46. Nl2type: inferring javascript function types from natural language information. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 304–315. IEEE.
  47. Type4py: Practical deep similarity learning-based type inference for python. In Proceedings of the 44th International Conference on Software Engineering, pages 2241–2252.
  48. L2ceval: Evaluating language-to-code generation capabilities of large language models. arXiv preprint arXiv:2309.17446.
  49. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations.
  50. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  51. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896.
  52. R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article.
  53. Proving test set contamination for black-box language models. In The Twelfth International Conference on Learning Representations.
  54. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  55. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE.
  56. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
  57. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. In The Twelfth International Conference on Learning Representations.
  58. Quantifying contamination in evaluating code generation capabilities of language models.
  59. Code generation with alphacodium: From prompt engineering to flow engineering. arXiv preprint arXiv:2401.08500.
  60. To the cutoff… and beyond? a longitudinal perspective on LLM data contamination. In The Twelfth International Conference on Learning Representations.
  61. Phind.
  62. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  63. Did chatgpt cheat on your test?
  64. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1):85–105.
  65. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
  66. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  67. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning, pages 31693–31715. PMLR.
  68. Nofuneval: Funny how code lms falter on requirements beyond functional correctness. arXiv preprint arXiv:2401.15963.
  69. Reinforcement learning from automatic feedback for high-quality unit test generation. arXiv preprint arXiv:2310.02368.
  70. Sql-palm: Improved large language modeladaptation for text-to-sql. arXiv preprint arXiv:2306.00739.
  71. Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
  72. Llmseceval: A dataset of natural language prompts for security evaluations. arXiv preprint arXiv:2303.09384.
  73. Methods2test: A dataset of focal methods mapped to test cases. In Proceedings of the 19th International Conference on Mining Software Repositories, pages 299–303.
  74. Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264.
  75. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
  76. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708.
  77. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481.
  78. On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 1398–1409.
  79. Typet5: Seq2seq type inference using static analysis. arXiv preprint arXiv:2303.09564.
  80. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120.
  81. ” according to…” prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252.
  82. Rethinking benchmark and contamination for language models with rephrased samples.
  83. Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714.
  84. Natural language to code generation in interactive data science notebooks. arXiv preprint arXiv:2212.09248.
  85. No more manual tests? evaluating and improving chatgpt for unit test generation. arXiv preprint arXiv:2305.04207.
  86. Parsel: A unified natural language framework for algorithmic reasoning. arXiv preprint arXiv:2212.10561.
  87. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570.
  88. Toolcoder: Teach code generation models to use apis with search tools. arXiv preprint arXiv:2305.04032.
  89. Self-edit: Fault-aware code editor for code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769–787, Toronto, Canada. Association for Computational Linguistics.
  90. Algo: Synthesizing algorithmic programs with generated oracle verifiers. arXiv preprint arXiv:2305.14591.
  91. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568.
  92. Opencodeinterpreter: Integrating code generation with execution and refinement. https://arxiv.org/abs/2402.14658.
  93. Codegen-test: An automatic code generation model integrating program test information. arXiv preprint arXiv:2202.07612.
  94. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
Citations (84)

Summary

  • The paper introduces LiveCodeBench, a comprehensive benchmark that uses new coding challenges to prevent evaluation contamination.
  • The methodology involves temporal segmentation and holistic testing across tasks like code generation, self-repair, code execution, and test output prediction.
  • Empirical findings reveal that while open-source models improve with instruction tuning, they still lag behind models like GPT-4 due to contamination effects.

LiveCodeBench: Holistic and Contamination Free Evaluation of LLMs for Code

Introduction

The utilization of LLMs in code generation and related areas has seen a significant rise. Existing benchmarks, such as HumanEval and MBPP, fall short in evaluating the diverse capabilities of modern LLMs due to their limited scope and susceptibility to contamination. "LiveCodeBench: Holistic and Contamination Free Evaluation of LLMs for Code" introduces a comprehensive benchmark, LiveCodeBench, to address these shortcomings by continuously incorporating new coding challenges from platforms like LeetCode, AtCoder, and CodeForces, ensuring a contamination-free evaluation.

Benchmark Design and Implementation

Contamination-Free Evaluation

LiveCodeBench systematically avoids data contamination by evaluating models solely on problems released after their respective training data cutoff dates. This temporal segmentation is pivotal in ensuring that model evaluations are not biased by prior exposure to benchmark problems, which is evident in the performance discrepancies of the DeepSeek-Instruct model when tested on pre- and post-September 2023 problems (Figure 1). Figure 1

Figure 1

Figure 1: LiveCodeBench effectively detects performance drops in models, like DeepSeek-Instruct, post their release dates, indicating possible contamination in earlier tasks.

Holistic Evaluation Framework

LiveCodeBench extends beyond traditional code generation benchmarks by including additional scenarios that reflect the multifaceted nature of programming:

  • Code Generation: Standard task of generating code from natural language descriptions.
  • Self-Repair: Evaluates a model's ability to debug and fix incorrect code based on execution errors.
  • Code Execution: Assesses comprehension through executing code snippets.
  • Test Output Prediction: Challenges the model's reasoning ability by generating expected outputs without providing the function's implementation.

These scenarios, visualized in a radial plot, highlight the variable performance of models across different tasks (Figure 2). Figure 2

Figure 2

Figure 2: Evaluation of LLMs across diverse coding-related scenarios, underscoring performance variability across tasks.

Quality and Diversity of Problems

LiveCodeBench hosts nearly 400 problems sourced over the span of several months, ensuring a balanced difficulty distribution and robust evaluation. The benchmark capitalizes on high-quality problem filters derived from competitive programming platforms, thereby delivering reliable assessment metrics for LLMs.

Empirical Findings

Detecting Contamination

Empirical analysis reveals significant contamination effects in models, such as DeepSeek, which exhibit stark performance reductions on problems post their release dates, affirming LiveCodeBench's efficacy in identifying contaminated evaluations.

Holistic Performance Insights

Overall, LiveCodeBench results demonstrate that open-source models, despite recent optimizations, lag behind closed-access counterparts like GPT-4 and Claude-3-Opus in diversified settings. Notably, instruction-tuned variants show marked improvements over base models, stressing the importance of fine-tuning datasets in enhancing LLM performance.

Comparison with HumanEval+ reveals potential overfitting, particularly in instruction-tuned models that perform well on isolated problems but falter on broader, varied tasks like those in LiveCodeBench (Figure 3). Figure 3

Figure 3: Scatter plot contrasts model performance on HumanEval+ and LCB-Easy, indicating possible overfitting in open-access models.

Conclusion

LiveCodeBench sets a new standard for evaluating code-focused LLMs through its dynamic and contamination-free framework. By leveraging real-world coding scenarios across multiple platforms, the benchmark provides invaluable insights into both the strengths and limitations of contemporary LLMs, fostering advancements in the development and fine-tuning of these models. The holistic evaluation not only facilitates fair model comparisons but also guides future research directions in refining code generation capabilities.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 36 likes about this paper.