Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (2411.16579v1)

Published 25 Nov 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Training LLMs to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model's capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of $76,321$ responses paired with step-level feedback. Fine-tuning LLMs with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor's self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor's exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase its potential. Our code and datasets are at \href{https://mathcritique.github.io/}{https://mathcritique.github.io/}.

PDF HTML Abstract

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

The paper under discussion presents a method to augment the reasoning capabilities of LLMs through the implementation of critique models providing feedback during both test-time and training-time. The approach distinguishes itself by employing a two-player paradigm, consisting of a reasoning (actor) model and a critique model, where the latter offers step-level supervision to refine complex reasoning tasks, particularly within domains such as science, coding, and mathematics.

Overview of Methodology

The authors introduce a framework called AutoMathCritique, designed to automate the synthesis of critique data. This framework is pivotal in generating a dataset of 76,321 samples with step-level feedback for mathematical reasoning tasks. AutoMathCritique operates without human supervision, constructing flawed reasoning paths via controlled error synthesis to ensure diversity and accuracy of feedback. Critique generation follows, where annotator models label the flaws and provide constructive feedback. A filtering process further ensures only high-quality critiques are retained.

In the aspect of integrating critique models into training, the paper presents a critique-in-the-loop self-improvement procedure aimed at enhancing the exploration efficiency of the actor model. By supervising the actor’s reasoning tasks during both test-time and training, the critique models enhance solution diversity and optimization, particularly on complex queries. Furthermore, the paper evaluates the effect of test-time scaling and computational allocation strategies, which consistently improved the majority voting and final output accuracy of responses.

Key Findings and Implications

A significant finding is that integrating critique models at test-time not only aids in correcting errors but also enhances the reasoning performance ceiling when scaling inference-time computation. This suggests a potential trajectory for refining reasoning models to tackle queries with varying difficulty levels more efficiently. The experimental results demonstrate that feedback from critique models aids in overcoming reasoning bottlenecks experienced with more complex queries, which is pivotal for tasks demanding higher accuracy in real-world applications.

The development and deployment of Automated Critique Models such as this one imply significant strides in scalability and the reduction of human labor in dataset curation. The application of such models in conjunction with step-level supervision during the model's self-improvement process promises more robust, generalized reasoning capabilities. This has profound implications for the future of AI where machine reasoning is required to be both accurate and deeply insightful, such as in fields involving automated problem-solving, decision-making, and creative processes.

Future Directions

For the continued evolution, emphasis could be placed on exploring the scalability and adaptability of critique models in varied reasoning domains beyond mathematics. Extending this framework to other areas could further affirm its efficacy and adaptability. Moreover, a deeper exploration into optimizing model parameters and structures specifically for critique tasks could catalyze significant performance enhancements.

Additionally, while the work largely focuses on interactions within the two-player framework, future research could explore the synergistic potential of integrating multi-player or ensemble-method based reasoning frameworks. This could leverage diverse critique perspectives, further refining result accuracy and reasoning robustness. Furthermore, exploring the implications of critique models for improving collaborative tasks and agentic decision-making could be a valuable avenue of research, given the increasing deployment of AI in socially interactive and cooperative environments.

Overall, the paper represents a significant step towards the advancement of reasoning models via automated critique and feedback mechanisms, providing valuable insights and frameworks for future innovations in AI reasoning technologies.