ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (2406.03816v3)

Published 6 Jun 2024 in cs.CL

Abstract: Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three LLMs for multiple iterations, and outperform other self-training algorithms such as ReST$^\text{EM}$ and Self-Rewarding LM. We release all code at https://github.com/THUDM/ReST-MCTS.

PDF HTML Abstract

LLM Self-Training via Process Reward Guided Tree Search

The paper introduces an innovative self-training mechanism for LLMs by integrating process reward guidance with tree search algorithms. The approach presented, referred to as *, aims to improve the reasoning capabilities of LLMs by automatically generating high-quality reasoning traces and per-step values to train both policy and reward models. This approach diverges from traditional self-training methods, which typically rely on manual annotations to train process rewards, making it a more scalable and efficient solution.

Methodology

The proposed framework leverages a variant of the Monte Carlo Tree Search (MCTS) process, named *, to facilitate the exploration of reasoning paths in a structured and strategic manner. The primary components of this methodology include the use of an initial policy model (LLM), a process reward model (PRM), and iterative self-training of both models using the generated high-quality data. The key steps in this methodology are outlined as follows:

Tree Search with Process Reward Model (PRM) Guidance: The main innovation lies in using the PRM to guide the MCTS. The tree search algorithm generates multiple solution steps and evaluates them using the PRM, which assigns rewards based on the quality of each step. The PRM is initially trained on a dataset that includes correct step-by-step solutions for questions.
Reward Inference: Once sufficient data is collected from various reasoning traces, the PRM infers the rewards for each step within the traces. This inferred data serves to refine the PRM further and improves its accuracy in evaluating future reasoning steps.
Self-Critic Mechanism: To enhance the decision-making process during tree search, a self-critic mechanism is employed. The mechanism allows the LLM to provide feedback and suggestions for the next steps or signal the end of the reasoning process, ensuring the tree search remains efficient and focused.
Backpropagation and Update: In line with typical reinforcement learning techniques, the values obtained from the PRM guide the backpropagation process, updating the nodes in the tree and adjusting the search strategy accordingly.
Iterative Self-Training: This process iteratively enhances both the policy and reward models. With each iteration, the models learn from the generated data, continuously improving their performance and the overall quality of the reasoning traces.

Numerical Results and Comparisons

The paper provides empirical evidence of the efficacy of * across several benchmarks, including the SciBench and MATH datasets. Key findings include:

* significantly outperforms existing self-training methods such as ReST $^\text{EM}$ and Self-Rewarding LM in terms of accuracy on complex reasoning tasks.
Under the same search budget, * delivers higher accuracy compared to baselines like Self-Consistency (SC) and Best-of-N (BoN).
The iterative self-training approach demonstrates consistent improvements in model performance over multiple iterations.

Practical and Theoretical Implications

The approach discussed in the paper holds considerable potential for advancing the self-training mechanisms of LLMs. The following implications can be inferred:

Scalability: By automating the reward generation process and incorporating a robust tree search strategy, * offers a scalable solution to improve LLMs, especially for tasks requiring deep reasoning and complex problem-solving.
Enhanced Reasoning: The integration of process reward guidance ensures that the intermediate steps in reasoning traces contribute meaningfully to the final solution, reducing the occurrence of incorrect traces that might still lead to correct answers by chance.
Generalizability: While the paper primarily focuses on science and math problems, the underlying principles of this approach are applicable to a broader range of reasoning tasks, potentially including code generation, theorem proving, and even more abstract domains such as conversational agents and strategic planning.

Future Developments

There are several avenues for future research and development that can build upon the findings of this paper:

Broader Application: Extending the methodology to other reasoning-intensive tasks beyond math and science to assess its universality and adaptability.
Larger and Diverse Models: Exploring the impact of scaling up the models involved in the process, as well as incorporating a wider variety of initial training datasets to cover diverse domains more effectively.
Real-time Adaptation: Implementing online learning techniques that allow models to adapt in real-time to dynamically changing data and tasks, enhancing the practical usability of * in real-world applications.

In conclusion, the paper presents a robust and scalable approach to self-training LLMs, demonstrating superior performance over traditional methods and highlighting promising directions for future research in enhancing the cognitive and reasoning capabilities of LLMs.