CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models (2502.16614v1)

Published 23 Feb 2025 in cs.CL

Abstract: The critique capacity of LLMs is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

Summary

The paper introduces CodeCriticBench, a comprehensive benchmark evaluating LLMs' code critique capabilities via dual tasks of code generation and code-based question answering.
It presents a dataset of 4,300 diverse code critique samples featuring fine-grained evaluation checklists and difficulty labels based on multi-model correctness statistics.
The benchmark utilizes both basic binary and advanced continuous metrics, conducting extensive experiments on 38 LLMs to analyze performance, scaling trends, and the benefits of CoT evaluation.

This paper presents a comprehensive benchmark (CodeCriticBench) that evaluates LLMs’ (LLMs) ability to critique code through dual tasks of code generation and code-based question answering.

It introduces a dataset comprising 4,300 diverse code critique samples with fine-grained evaluation checklists and difficulty labels based on multi-model correctness statistics.
The benchmark employs both basic binary correctness metrics (ACC, Pass@1) and advanced continuous measures (MSE calibrated via human annotations) to assess the nuanced critique capabilities of LLMs.
Extensive experiments across 38 LLMs, spanning parameter scales from 0.5B to over 70B and including both open- and closed-source models, reveal scaling trends, the benefits of CoT evaluation, and detailed performance analyses on identifying code errors and optimizing critique feedback.

PDF Markdown

Tweets

https://twitter.com/zhngchn95319950/status/1894605612018995640

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models (2502.16614v1)

Summary

Related Papers

Tweets