An Overview of the DeepCritic Framework: Enhancing Math Critique Abilities in LLMs
The paper, "DeepCritic: Deliberate Critique with LLMs," addresses the problem of shallow and superficial critiques generated by current LLMs in the domain of mathematical reasoning. As LLMs evolve, the scalability and effectiveness of human-like supervision become challenging due to costs and complexity. This work introduces DeepCritic, a two-stage framework designed to enhance the critique capabilities of LLMs to deliver deliberate and thoughtful critiques, particularly in mathematical reasoning tasks.
Motivation and Problem Statement
With the rapid progression of LLMs, providing accurate feedback on their outputs is critical. Existing LLMs, when utilized as critique models, often produce analyses that lack depth, leading to low judgment accuracy. This inadequacy makes it difficult for LLM generators to refine solutions based on these critiques, primarily affecting complex domains like mathematical reasoning.
Methodology
The DeepCritic framework consists of two key stages:
- Supervised Fine-Tuning with Deliberate Critique Data:
In the first stage, the authors generate a dataset comprising approximately 4.5K long-form critiques using Qwen2.5-72B-Instruct. The deliberate critiques are structured to include multi-perspective verifications and in-depth evaluations of each reasoning step. This involves two crucial components:
- Initial Critique: For each reasoning step, initial critiques are generated that consider logical consistency and accuracy within the problem context.
- In-Depth Critique: Following the initial evaluation, an in-depth critique reassesses the step from different perspectives or critiques the initial evaluation itself.
This process ensures that the critiques are comprehensive, and combines initial and in-depth critiques into one detailed critique for training the LLM.
- Reinforcement Learning (RL):
The second stage employs RL to further incentivize the critique capabilities of the LLM. Two settings are explored for achieving this:
- Using Human-Labeled Data: PRM800K serves as the dataset, leveraging human annotations for RL.
- Utilizing Automatically Annotated Data: Problems are sampled from GSM8K, MATH, and Olympiads with solutions generated via Monte Carlo sampling. This setting utilizes data where human annotation is impractical, ensuring scalable oversight.
Experimental Results
The evaluation compares DeepCritic models with baseline PRMs and various instruction-following LLMs configured as critique models, across error identification benchmarks like MR-GSM8K, PRM800K, and ProcessBench. The DeepCritic models exhibit significant accuracy improvements, surpassing the judgment performance of existing models, including advanced reasoning LLMs like DeepSeek-R1-Distill models and GPT-4o.
Furthermore, experiments highlight promising test-time scaling properties. Critics offer improved verified majority voting performance, enhancing generator outputs through more accurate solution assessments. Additionally, critique-based refinement effectively aids LLM generators in correcting errors during test-time.
Implications and Future Directions
The implications of DeepCritic are twofold. Practically, it enhances the detailed feedback LLMs can provide, improving the oversight of mathematical reasoning. Theoretically, as the deliberate critique capabilities of DeepCritic demonstrate scalability, this approach can be adapted to other complex domains, providing a pathway for future developments in AI. The robust framework encourages further exploration into self-evolving critic models, presenting an avenue for weak-to-strong supervision that could be pivotal in shaping next-generation scalable LLM oversight.
In conclusion, DeepCritic sets a precedent for substantial improvement in LLM critique capability through structured, deliberate analysis, facilitating more accurate judgments and offering comprehensive feedback for refining LLM outputs. This work contributes a substantial methodology catering to advancing AI capabilities in critique-based evaluations, heralding future innovations in scalable and automated LLM supervision.