Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code (2409.10280v1)

Published 16 Sep 2024 in cs.SE

Abstract: In recent years, the application of LLMs to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which do not reflect the diverse challenges developers face in real-world contexts. To address this, we introduce ComplexCodeEval, a benchmark designed to assess LCMs in various development tasks, including code generation, completion, API recommendation, and test case generation. It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories, each annotated with function signatures, docstrings, and API references to simulate real development environments. Our experiments across ten LCMs reveal that context improves performance and that data leakage can lead to overestimation, highlighting the need for more accurate evaluations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jia Feng (4 papers)
  2. Jiachen Liu (45 papers)
  3. Cuiyun Gao (97 papers)
  4. Chun Yong Chong (18 papers)
  5. Chaozheng Wang (28 papers)
  6. Shan Gao (70 papers)
  7. Xin Xia (171 papers)