Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enhancing LLM Reasoning with Reward-guided Tree Search

Published 18 Nov 2024 in cs.CL and cs.AI | (2411.11694v4)

Abstract: Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, LLMs~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accurate responses. However, developing an o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research. In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms. This framework is implemented by integrating the policy model, reward model, and search algorithm. It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model. The implemented framework is denoted as \textbf{STILL-1}. We thoroughly explore various design considerations necessary for implementing this framework and provide a detailed report of the technical aspects. To assess the effectiveness of our approach, we focus on mathematical reasoning tasks and conduct extensive evaluations on four challenging datasets, significantly enhancing the reasoning abilities of LLMs.

Citations (1)

Summary

  • The paper introduces a reward-guided tree search framework that enhances reasoning in LLMs for mathematical tasks.
  • It integrates policy models, reward feedback, and Monte Carlo Tree Search to optimize exploration of reasoning paths.
  • Experimental evaluations on multiple benchmarks show significant improvements in reasoning accuracy over baseline methods.

The paper discusses a framework that improves reasoning in LLMs using a reward-guided tree search mechanism. This approach offers a structured method to expand the reasoning capabilities of LLMs by leveraging search algorithms, a carefully designed reward model, and a policy model. The study concentrates on mathematical reasoning tasks, evaluating the framework on multiple challenging datasets.

Introduction

Advancements in LLMs have significantly improved their capabilities across various domains. However, they still face challenges in complex reasoning tasks prevalent in STEM fields. The paper introduces a framework integrating a policy model, a reward model, and a search algorithm centered around tree search to enhance reasoning. This approach allows LLMs to explore reasoning spaces more comprehensively by guiding the policy model through reward feedback within a tree structure.

Methodology

The framework consists of three parts:

  1. Policy Model: The LLM adapts to tree-structured reasoning formats through supervised instruction tuning. Data synthesized from a high-capacity LLM refines the reasoning process of the policy model under a reward-guided mechanism.
  2. Reward Model: The reward model evaluates reasoning paths to guide the policy model. It employs outcome-supervised, generative, and scoring-based methodologies to provide feedback critical for enhancing reasoning trajectories.
  3. Search Algorithm: Employing Monte Carlo Tree Search (MCTS), the model navigates dynamically through reasoning spaces, balancing exploration and exploitation. Enhancement strategies, such as pre-expansion and self-consistency checks, optimize search performance.

Experimental Evaluation

The framework was tested on four benchmarks: MATH-OAI, GSM-Hard, Olympiad Bench, and College Math. Results demonstrate notable improvements over baseline methods, particularly in reasoning accuracy. The iterative refinement of both the policy and reward models through selection strategies resulted in mutual enhancement, confirming the effectiveness of the framework. Figure 1

Figure 1: An overview of the implemented reasoning framework in this work.

Detailed Analysis

Policy Model Analysis

Data synthesis from high-capacity LLMs significantly improved model adaptation to the reasoning format. The adequacy and quality of synthetic data from domain-specific and general-purpose models were critical. Policy model training with optimized preference data further enhanced reasoning accuracy, verified through diverse reasoning paths generated during test phases.

Reward Model Design

Implementing an outcome-supervised, generative model proved effective in both outcome-level and step-level supervision. Active learning strategies allowed the model to discern significant positive and negative samples, crucial for training efficacy. Various design considerations, such as discriminative versus generative approaches and ranking-based versus scoring-based models, were explored to optimize feedback quality. Figure 2

Figure 2: Performance of policy model on four benchmarks by using different amounts of training data for reasoning format adaptation. ``CoT'' means the CoT generation with greedy search.

Search Algorithm Optimization

The choice of search algorithm significantly impacted reasoning performance. MCTS and its variant MCTSG_G were particularly effective due to their exploration capabilities. The integration of self-consistency mechanisms and domain-specific tools, such as calculators for error checks, helped in maintaining logical consistency in the reasoning paths. Figure 3

Figure 3: Self-consistency results with and without optimization using our training method.

Figure 4

Figure 4: Effect of search algorithms and pre-expansion depth on model performance.

Conclusion

The reward-guided tree search framework offers a structured way to enhance reasoning in LLMs. By integrating advanced training, analysis, and search capabilities, this framework successfully addresses complex reasoning challenges in mathematical domains. Future work should focus on refining this framework to extend its applicability and efficiency in real-world reasoning tasks while further exploring scalable strategies for LLM refinement.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.

HackerNews