Critique-Coder: Self-Evaluation in Code LLMs
- Critique-Coder is a paradigm that embeds explicit self-critique and binary judgment into LLM code generation, review, and revision workflows.
- It leverages multi-stage training methods, including supervised fine-tuning and reinforcement learning, to jointly optimize generation and critique outputs.
- Empirical results demonstrate improved code pass rates and error reduction through self-critique selection, iterative revision, and adaptive refinement cycles.
A Critique-Coder is a LLM system that explicitly incorporates critique and self-evaluation capabilities into code generation, code review, suggestion, or refinement workflows. Unlike traditional code LLMs optimized solely for accuracy or reward (e.g., pass@k via RLHF), Critique-Coder models are trained, prompted, or deployed to produce detailed critical analyses, binary judgments, or actionable feedback on code artifacts—coded as an integral part of their learning objective, inference-time procedures, or both. The Critique-Coder paradigm spans reinforcement learning (RL), supervised learning, evaluation, iterative revision, and retrieval-augmented generation (RAG) strategies in contemporary LLM research for code and general reasoning.
1. High-Capacity Dataset Construction for Critique Coding
A central enabler for Critique-Coder systems is the assembly of large, high-quality datasets pairing problems, code solutions, and detailed critiques. OpenCodeReasoning-II introduces a 2.5M triple (question, solution, critique) dataset across ≈35,000 unique problems (≈1.4M Python, ≈1.1M C++), nearly doubling prior datasets in scale. Each record comprises:
- Full problem statement (with unit tests);
- Solution “chain-of-thought” trace in > …</think> plus a final code block;
- Critique “chain-of-thought” in <think>… and a binary <judgment>right/wrong</judgment>;
For 60% of examples, pass rates on 5–50 test cases.
Deduplication is performed via cosine similarity with Llama-3.3-70B validation to prevent overlap with evaluation benchmarks (Ahmad et al., 11 Jul 2025).
Similarly, evaluation datasets such as CriticBench (Luo et al., 2023) and CodeCriticBench (Zhang et al., 23 Feb 2025) provide large-scale, adversarially constructed triplets for robust measurement of model critique ability, with CodeCriticBench spanning both code generation and code-QA, difficulty-stratified, and annotated with fine-grained critique dimensions.
2. Fine-Tuning and Critique-Integrated Training Objectives
Critique-Coder systems adopt multi-stage fine-tuning or hybrid RL variants specifically optimized to develop both generative and critical faculties. The most salient paradigms include:
- Two-Stage Supervised Fine-Tuning: Stage I trains purely for code generation; Stage II jointly fine-tunes generation and critique outputs with a multi-task loss:
where and are next-token cross-entropies over solution and critique traces, respectively, and is typical (Ahmad et al., 11 Jul 2025).
- Critique Reinforcement Learning (CRL): Here, models are prompted with (question, solution) and generate a critique concluding with a binary judgment . The reward is , uniquely incentivizing correct self-evaluation. Standard RL samples are mixed in (typically 80% RL/20% CRL), and the combination is optimized via Group Relative Policy Optimization (GRPO) (Ruan et al., 26 Sep 2025).
- Iterative Self-Improvement via ACR: Adaptive Critique Refinement (ACR) cycles supervised fine-tuning with model-generated responses, LLM-based judging (Elo aggregation plus executor), and selective critique of suboptimal completions, followed by re-finetuning on the enhanced or criticized outputs (Zhou et al., 13 Feb 2025).
- CTRL for Decoupled Critic Training: In CTRL, a critic LLM is trained to generate critiques maximizing downstream correction performance by a fixed generator. This is achieved with GRPO, aligning the critic’s feedback with the likelihood of subsequent model-corrected solutions passing tests (Xie et al., 5 Feb 2025).
3. Self-Critique, Test-Time Scaling, and Inference-Time Procedures
Critique-Coder architectures operationalize critique during inference/testing in several modes:
- Self-Critique Selection: Multiple solution+critique pairs are generated in parallel. Only solutions whose critiques are judged "right" are considered, with a scoring heuristic favoring shorter critique traces—motivated by the empirical observation that brevity in correct explanations correlates with actual validity (Ahmad et al., 11 Jul 2025).
1 2 3 4 5 6 7 |
for i in 1..k parallel: y_i, c_i = sample_solution_and_critique(x) if c_i.judgment == "right": s_i = -len(c_i.think_trace) else: s_i = -∞ # select best solution by s_i |
- Iterative Critique–Revision: Generator and critic are alternated: solution generation→critique→guided revision, repeated until a terminal judgment of "correct" is issued or a fixed iteration limit is reached. This structure is used in tool-interactive frameworks such as CRITIC (Gou et al., 2023) and CTRL (Xie et al., 5 Feb 2025). Dynamic tool feedback (execution traces, error messages, test-case outputs) is piped into the critique prompt.
- Self-Check and Critique-Filtering: Candidate solutions are filtered by an internal or external critic, with only those passing a correctness threshold considered. A simple “mode of filtered set” selection leverages the SC² baseline for error reduction (Luo et al., 2023).
- Retrieval/Generation Adapters: CARD, as an adaptive critique model, decides at inference time whether to query the retrieval module based on a learned necessity score , balancing retrieval coverage and efficiency. It further reranks multiple candidates by a convex combination of necessity and generation confidence (Zhang et al., 2024).
4. Evaluation Protocols and Benchmarks for Critique Ability
Critique accuracy is defined as the probability the model’s critique aligns with ground-truth labels:
Benchmarks deploy a balanced mix of correct/incorrect code, sampled or filtered adversarially (e.g., most unit tests passed but not all), ensuring that evaluation captures both “easy” and subtly “wrong yet plausible” code (Luo et al., 2023, Zhang et al., 23 Feb 2025).
In advanced critique assessment (CodeCriticBench), feedback is scored along 10 axes (e.g., correctness, time complexity, readability, robustness, maintainability). Aggregate scores by averaging dimensional ratings, with mean squared error (MSE) to human-calibrated ratings as the primary fidelity metric (Zhang et al., 23 Feb 2025).
Across all regimes, scaling laws are observed: critique ability and score consistency improve monotonically with model size, with top-tier open or distilled models (DeepSeek-R1, o1-preview) achieving ≈75% binary accuracy, but performance degrades on hard or adversarially bug-injected samples (≈50%) and for nuanced error types.
5. Comparative Empirical Performance and Ablation
The incorporation of critique-auxiliary objectives yields consistent, quantifiable gains:
- OpenCodeReasoning-II’s OCR-2-32B achieves pass@1 = 61.3%, critique@10 = 67.4% on LiveCodeBench (Python), a +6.1 increase over pass@1 after self-critique filtering, with similar ∆ on C++ (Ahmad et al., 11 Jul 2025).
- Critique-Coder-8B, using 20% CRL data, obtains 60.8% pass@1 on LiveCodeBench v5, outperforming DeepCoder-14B (60.6) and GPT-o1 (59.5), with marked transfer gains on logic reasoning suites from BIG-Bench Extra Hard (e.g., +6.1 points on average) (Ruan et al., 26 Sep 2025).
- Iterative ACR cycles in RefineCoder-7B drive 3-point absolute pass@1 gain across four benchmarks with only 20 K starting instruction examples (Zhou et al., 13 Feb 2025).
- In tool-interactive CRITIC, code correctness improves 3–6 percentage points with only 2–4 critique–revision rounds (Gou et al., 2023).
- Supervised and RL critics (CTRL) double to triple pass@1 on challenging code problems compared to zero-shot or self-critique without tool augmentation. Critic-guided revision delivers up to 106.1% relative improvement when coupled with strong generators (Xie et al., 5 Feb 2025).
Critique accuracy for self-generated code (self-critique) is lower than for third-party code critiques, with even large models (PaLM-2-L/CodeLlama-34B) struggling to rise above 54% on HumanEval (Luo et al., 2023, Zhang et al., 23 Feb 2025).
6. Critique-Coder in Human-in-the-Loop and Hybrid Contexts
Critique-Coder models can be tightly integrated into human annotation, review, or labeling workflows. For annotation (e.g., qualitative text coding), a high-recall first-pass LLM annotator is filtered by a secondary LLM critic employing explicit error-type decision policies. This setup delivers F₁ score gains up to +0.25 on poorly performing codes while keeping compute costs modest (second-stage critic invoked only on ~15% of instances) (Dunivin et al., 14 Jan 2026).
For code review, LLM critics (CriticGPT) trained with RLHF on contractor-inserted bugs and synthetic/preferred feedback can surpass human bug detection rates (preferred 63% of the time, ~2.7× bug inclusion rate over unaided human review), with hybrid (human+critic) teams reducing hallucinated bug reports to practical levels (McAleese et al., 2024).
7. Limitations, Open Problems, and Recommendations
Notwithstanding robust empirical gains and improved interpretability, several limitations persist:
- Self-critique accuracy remains low relative to externally-supervised or tool-augmented critique; attempts to aggregate self-judgments at test time yield no further improvements (Ruan et al., 26 Sep 2025).
- Excessive critique objective emphasis (e.g., >50% CRL data) degrades generation performance, suggesting the need for careful hybridization (Ruan et al., 26 Sep 2025).
- High-quality critique signals (ground-truth judgments, gold critiques for RLHF, adversarial tampering for code review) are expensive and can bottleneck scaling.
- Error coverage is uneven, with rare bug types and nuanced error classes still eluding both critics and generators (Zhang et al., 23 Feb 2025).
Proven recommendations include:
- Controlled hybridization of RL and critique objectives (optimal 20/80 proportion in CRL+RL) (Ruan et al., 26 Sep 2025);
- Adversarial tampering of code+bug distributions to robustify critics against unforeseen error modes (McAleese et al., 2024);
- Modular and minimal-compute secondary critic pipelines for high-precision annotation or quality assurance (Dunivin et al., 14 Jan 2026);
- Fine-grained, LLM-generated checklists and chain-of-thought exemplars in critique prompts to boost fidelity and error coverage (Zhang et al., 23 Feb 2025, Luo et al., 2023).
Critique-Coder frameworks represent a convergence of LLM reasoning, critique, agentic self-correction, and large-scale human-computer collaboration, with emerging evidence of broad transfer not only to code domains but to logic, planning, and text annotation. The paradigm is positioned as a core ingredient for scalable, trustworthy, and robust AI-assisted code development and quality assurance.