LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (2403.07974v2)

Published 12 Mar 2024 in cs.SE, cs.CL, and cs.LG

Abstract: LLMs applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

PDF HTML Abstract

Analysis of "LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code"

"LiveCodeBench: Holistic and Contamination-Free Evaluation of LLMs for Code" presents a novel benchmarking framework tailored to assess the coding capabilities of LLMs. The benchmark aims to address several limitations identified in existing evaluation methods, particularly contamination and limited scope in assessing code-related tasks.

Core Contributions

This paper introduces a benchmark, LiveCodeBench, that tackles two main challenges: first, the contamination of evaluation datasets due to training overlaps, and second, the constraint of assessing LLMs on mere code generation tasks. LiveCodeBench proposes a continuous evaluation framework that includes problems from platforms such as LeetCode, AtCoder, and CodeForces, added as new contests emerge. This ensures that LLMs are evaluated on problems they likely have not encountered prior to their development, thus mitigating contamination risks. Furthermore, evaluations are expanded beyond code generation, encompassing self-repair, code execution, and test output prediction—tasks that reflect multifaceted programming abilities required in real-world scenarios.

Empirical Findings and Benchmarks

The authors evaluate 29 models across different tasks, revealing several insightful observations. Notably, the potential contamination of datasets used in LLM training is demonstrated through an evident drop in the performance of DeepSeek models on coding problems released post the models’ cut-off dates. This reinforces the necessity of LiveCodeBench's evolving test set. Additionally, the diverse task scenarios highlight variances in model capabilities. For example, models such as Claude-3-Opus show strengths in code execution tasks, suggesting a variance in aptitude across different code-related tasks that standard benchmarks fail to capture.

Furthermore, performance on the HumanEval benchmark—a commonly used LLM evaluation—appears to potentially overestimate certain model capabilities compared to LiveCodeBench. This discrepancy suggests that HumanEval may not fully assess the broader coding capabilities required, underscoring the importance of LiveCodeBench's holistic evaluation approach.

Implications and Future Work

The proposed LiveCodeBench offers several implications for future research and development of code LLMs. By maintaining a live, dynamic test set, it ensures evaluations reflect current and realistic coding challenges, encouraging model development focused on general and robust code understanding capabilities rather than narrow task specialization or overfitting specific datasets. The performance implications suggest promising directions for model tuning across varied facets of programming tasks, potentially leading to more comprehensive programming assistance tools.

Looking forward, extending LiveCodeBench to include more diverse problem domains—beyond competition programming—and supporting multiple programming languages can further enhance its utility and applicability in diverse software engineering contexts. Integrating these aspects would provide a more nuanced understanding of LLMs' capabilities and push for developments that more closely align with the real-world utility of such models in complex, multi-language coding environments.

In conclusion, LiveCodeBench sets a new precedent for evaluating code LLMs by addressing critical evaluation challenges and extending the scope of assessment. This provides a stronger foundation for understanding model strengths and gaps, guiding future advancements in AI that better meet the nuanced demands of software development.