Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning (2405.00451v2)

Published 1 May 2024 in cs.AI and cs.LG

Abstract: We introduce an approach aimed at enhancing the reasoning capabilities of LLMs through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate remarkable performance improvements over existing models. For instance, our approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and ARC-C, with substantial increases in accuracy to $81.8\%$ (+$5.9\%$), $34.7\%$ (+$5.8\%$), and $76.4\%$ (+$15.8\%$), respectively. Additionally, our research delves into the training and inference compute tradeoff, providing insights into how our method effectively maximizes performance gains. Our code is publicly available at https://github.com/YuxiXie/MCTS-DPO.

PDF Abstract

Enhancing Reasoning in LLMs through Iterative Preference Learning and Monte Carlo Tree Search

In the research work entitled "Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning," the authors propose an innovative methodology to advance the reasoning capabilities of LLMs. The paper integrates Monte Carlo Tree Search (MCTS) with an iterative preference learning framework, drawing inspiration from the successful application seen in AlphaZero. This paper presents a distinctive strategy whereby preference data is gathered through MCTS, with the aim to improve LLMs' alignment with human-like reasoning and decision-making.

Methodology and Implementation

The authors introduce an approach that diverges from traditional offline static reward models by employing a dynamic iterative process. The methodology involves the decomposition of instance-level rewards into granular step-level signals, enabled by the look-ahead and exploratory capabilities of MCTS. The framework leverages Direct Preference Optimization (DPO) for updating the LLM policy based on newly collected preference data. This is a strategic shift from conventional Reinforcement Learning from Human Feedback (RLHF), emphasizing real-time improvement through an on-policy data sampling approach.

The iterative preference learning process enriches the model's intermediate reasoning steps by combining outcome validation with self-evaluation. The theoretical underpinnings of the paper accentuate the necessity of continuous policy refinement, utilizing both the critical role of ongoing preference data acquisition and the dynamic nature of policy adaptation in improving LLM performance.

Experimental Evaluation and Results

The paper reports comprehensive experiments on a range of arithmetic and commonsense reasoning benchmarks such as GSM8K, MATH, and SciQ, showcasing significant accuracy enhancements over existing models. Notably, the proposed approach achieves accuracy improvements of 4.8% on GSM8K, 3.3% on MATH, and an impressive 7.7% on SciQ, surpassing the Mistral-7B Supervised Fine-Tuning baseline. The integration of MCTS demonstrates a tangible impact on model performance, highlighting the efficacy of iterative preference learning in conjunction with MCTS as a policy improvement operator.

Additionally, the research explores the computational implications of training versus inference, providing insights into efficient compute-resource utilization to maximize performance gains.

Theoretical Implications and Future Directions

The research presents a compelling argument for the imperative of using on-policy sampled data in achieving self-improving training outcomes. The paper underscores the theoretical advantages of online learning frameworks over traditional static data-dependent methods, motivating a shift towards more adaptable and evolving preference learning paradigms in LLMs.

This paper points towards several avenues for future research, including exploring diverse policy-checkpoint sampling strategies and refining MCTS parameters to enhance training data variety. There is potential for future work to investigate the integration of reward model signals within MCTS to further augment performance.

Conclusion

Overall, this paper contributes significantly to the field of AI and LLMs by illustrating the benefits of integrating MCTS into iterative preference learning. Through a methodical and theoretically grounded approach, the authors enhance the reasoning capabilities of LLMs, setting a foundation for future explorations in dynamically aligning AI systems with complex human preferences and decision-making processes. The findings hold promise for advancing LLMs' efficacy in diverse reasoning tasks, steering the AI community towards more sophisticated and human-aligned models.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yuxi Xie (16 papers)
Anirudh Goyal (93 papers)
Wenyue Zheng (2 papers)
Min-Yen Kan (92 papers)
Timothy P. Lillicrap (19 papers)
Kenji Kawaguchi (147 papers)
Michael Shieh (9 papers)

Citations (32)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/anirudhg9119/status/1823471723401896379

https://twitter.com/sigrid_xie/status/1866171163556433956

https://twitter.com/JoshPurtell/status/1823414460930810064

https://twitter.com/gm8xx8/status/1785852910359708074

https://twitter.com/zustimmungswahl/status/1916593387685175578

https://twitter.com/potato_y_salad/status/1811075374765199362