Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepCritic: Deliberate Critique with Large Language Models (2505.00662v1)

Published 1 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.

An Overview of the DeepCritic Framework: Enhancing Math Critique Abilities in LLMs

The paper, "DeepCritic: Deliberate Critique with LLMs," addresses the problem of shallow and superficial critiques generated by current LLMs in the domain of mathematical reasoning. As LLMs evolve, the scalability and effectiveness of human-like supervision become challenging due to costs and complexity. This work introduces DeepCritic, a two-stage framework designed to enhance the critique capabilities of LLMs to deliver deliberate and thoughtful critiques, particularly in mathematical reasoning tasks.

Motivation and Problem Statement

With the rapid progression of LLMs, providing accurate feedback on their outputs is critical. Existing LLMs, when utilized as critique models, often produce analyses that lack depth, leading to low judgment accuracy. This inadequacy makes it difficult for LLM generators to refine solutions based on these critiques, primarily affecting complex domains like mathematical reasoning.

Methodology

The DeepCritic framework consists of two key stages:

  1. Supervised Fine-Tuning with Deliberate Critique Data:

In the first stage, the authors generate a dataset comprising approximately 4.5K long-form critiques using Qwen2.5-72B-Instruct. The deliberate critiques are structured to include multi-perspective verifications and in-depth evaluations of each reasoning step. This involves two crucial components:

  • Initial Critique: For each reasoning step, initial critiques are generated that consider logical consistency and accuracy within the problem context.
  • In-Depth Critique: Following the initial evaluation, an in-depth critique reassesses the step from different perspectives or critiques the initial evaluation itself.

This process ensures that the critiques are comprehensive, and combines initial and in-depth critiques into one detailed critique for training the LLM.

  1. Reinforcement Learning (RL):

The second stage employs RL to further incentivize the critique capabilities of the LLM. Two settings are explored for achieving this:

  • Using Human-Labeled Data: PRM800K serves as the dataset, leveraging human annotations for RL.
  • Utilizing Automatically Annotated Data: Problems are sampled from GSM8K, MATH, and Olympiads with solutions generated via Monte Carlo sampling. This setting utilizes data where human annotation is impractical, ensuring scalable oversight.

Experimental Results

The evaluation compares DeepCritic models with baseline PRMs and various instruction-following LLMs configured as critique models, across error identification benchmarks like MR-GSM8K, PRM800K, and ProcessBench. The DeepCritic models exhibit significant accuracy improvements, surpassing the judgment performance of existing models, including advanced reasoning LLMs like DeepSeek-R1-Distill models and GPT-4o.

Furthermore, experiments highlight promising test-time scaling properties. Critics offer improved verified majority voting performance, enhancing generator outputs through more accurate solution assessments. Additionally, critique-based refinement effectively aids LLM generators in correcting errors during test-time.

Implications and Future Directions

The implications of DeepCritic are twofold. Practically, it enhances the detailed feedback LLMs can provide, improving the oversight of mathematical reasoning. Theoretically, as the deliberate critique capabilities of DeepCritic demonstrate scalability, this approach can be adapted to other complex domains, providing a pathway for future developments in AI. The robust framework encourages further exploration into self-evolving critic models, presenting an avenue for weak-to-strong supervision that could be pivotal in shaping next-generation scalable LLM oversight.

In conclusion, DeepCritic sets a precedent for substantial improvement in LLM critique capability through structured, deliberate analysis, facilitating more accurate judgments and offering comprehensive feedback for refining LLM outputs. This work contributes a substantial methodology catering to advancing AI capabilities in critique-based evaluations, heralding future innovations in scalable and automated LLM supervision.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wenkai Yang (24 papers)
  2. Jingwen Chen (21 papers)
  3. Yankai Lin (125 papers)
  4. Ji-Rong Wen (299 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com