Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? (2408.10718v2)

Published 20 Aug 2024 in cs.SE and cs.CL

Abstract: Recent advancements in LLMs have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our codes and benchmark are available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yuwei Zhao (13 papers)
  2. Ziyang Luo (35 papers)
  3. Yuchen Tian (12 papers)
  4. Hongzhan Lin (33 papers)
  5. Weixiang Yan (11 papers)
  6. Annan Li (14 papers)
  7. Jing Ma (136 papers)
Citations (1)