Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models (2502.16614v1)

Published 23 Feb 2025 in cs.CL

Abstract: The critique capacity of LLMs is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

Summary

  • The paper introduces CodeCriticBench, a comprehensive benchmark evaluating LLMs' code critique capabilities via dual tasks of code generation and code-based question answering.
  • It presents a dataset of 4,300 diverse code critique samples featuring fine-grained evaluation checklists and difficulty labels based on multi-model correctness statistics.
  • The benchmark utilizes both basic binary and advanced continuous metrics, conducting extensive experiments on 38 LLMs to analyze performance, scaling trends, and the benefits of CoT evaluation.

This paper presents a comprehensive benchmark (CodeCriticBench) that evaluates LLMs’ (LLMs) ability to critique code through dual tasks of code generation and code-based question answering.

  • It introduces a dataset comprising 4,300 diverse code critique samples with fine-grained evaluation checklists and difficulty labels based on multi-model correctness statistics.
  • The benchmark employs both basic binary correctness metrics (ACC, Pass@1) and advanced continuous measures (MSE calibrated via human annotations) to assess the nuanced critique capabilities of LLMs.
  • Extensive experiments across 38 LLMs, spanning parameter scales from 0.5B to over 70B and including both open- and closed-source models, reveal scaling trends, the benefits of CoT evaluation, and detailed performance analyses on identifying code errors and optimizing critique feedback.
X Twitter Logo Streamline Icon: https://streamlinehq.com