Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning (2410.06508v1)

Published 9 Oct 2024 in cs.LG and cs.CL

Abstract: Monte Carlo Tree Search (MCTS) has recently emerged as a powerful technique for enhancing the reasoning capabilities of LLMs. Techniques such as SFT or DPO have enabled LLMs to distill high-quality behaviors from MCTS, improving their reasoning performance. However, existing distillation methods underutilize the rich trajectory information generated by MCTS, limiting the potential for improvements in LLM reasoning. In this paper, we propose AlphaLLM-CPL, a novel pairwise training framework that enables LLMs to self-improve through MCTS behavior distillation. AlphaLLM-CPL efficiently leverages MCTS trajectories via two key innovations: (1) AlphaLLM-CPL constructs stepwise trajectory pairs from child nodes sharing the same parent in the search tree, providing step-level information for more effective MCTS behavior distillation. (2) AlphaLLM-CPL introduces curriculum preference learning, dynamically adjusting the training sequence of trajectory pairs in each offline training epoch to prioritize critical learning steps and mitigate overfitting. Experimental results on mathematical reasoning tasks demonstrate that AlphaLLM-CPL significantly outperforms previous MCTS behavior distillation methods, substantially boosting the reasoning capabilities of LLMs.

PDF HTML Abstract

Leveraging Stepwise Knowledge with Curriculum Preference Learning for Self-Improvement in LLMs

The paper introduces AlphaLLM-CPL, a novel framework for the self-improvement of LLMs through the distillation of behaviors exhibited during Monte Carlo Tree Search (MCTS). The model represents a significant step forward in improving the reasoning capabilities of LLMs by utilizing stepwise trajectory pairs and employing a strategic training methodology termed Curriculum Preference Learning (CPL). The research addresses the limitations of existing distillation methods and introduces innovative techniques to optimize the distillation process, resulting in enhanced performance, particularly in mathematical reasoning tasks.

Key Innovations

Stepwise Trajectory Pairs: This approach constructs pairs from child nodes with the same parent in the search tree, thus capturing more granular, step-level information. This stepwise construction of trajectory pairs allows for more effective utilization of the data generated during MCTS, addressing the underutilization noted in previous methods.
Curriculum Preference Learning (CPL): CPL dynamically adjusts the training sequence of trajectory pairs to prioritize those that are more critical to learning. This method involves a novel prioritization strategy based on both the preference reward gap, derived from the MCTS reward model, and the policy prediction gap based on LLM output likelihoods.

Experimental Results

The experimental validation on datasets such as GSM8K and MATH highlights that AlphaLLM-CPL delivers substantial improvements over previous MCTS-based methodologies. For instance, AlphaLLM-CPL enhances the performance of base models like LLaMA2-7B and Mistral-7B on the GSM8K task with significant performance boosts of 150% and 48.8%, respectively. On more challenging datasets, such as MATH, the framework achieves up to a 17.4% improvement, highlighting its efficacy in handling complex reasoning problems.

Theoretical and Practical Implications

AlphaLLM-CPL demonstrates that integrating stepwise knowledge and CPL into LLM training frameworks can markedly improve reasoning abilities. Theoretically, this proposes a new paradigm of not just relying on final trajectory data but focusing on intermediate decision steps, leading to more nuanced learning. Practically, it minimizes the dependency on expensive and large-scale supervised data or stronger LLMs for fine-tuning, offering a cost-effective path toward model improvement.

Future Directions

The success of AlphaLLM-CPL opens avenues for future research in several areas:

Extended Applications: While current experiments focus on mathematical reasoning, the framework's applicability to broader NLP tasks involving complex decision-making can be explored.
Optimized Search Algorithms: Further refining the process of MCTS or integrating it with other heuristic search methods could yield additional performance gains.
Dynamic Curriculum Strategies: Developing more sophisticated CPL algorithms that dynamically adapt to LLM learning stages might provide deeper insights into the optimization of model training.

In summary, this research contributes an effective approach for enhancing LLMs by focusing on stepwise learning and curriculum-based preference training. Its findings are pertinent for advancing self-improvement in AI models, facilitating their application in increasingly complex reasoning tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xiyao Wang (26 papers)
Linfeng Song (76 papers)
Ye Tian (190 papers)
Dian Yu (78 papers)
Baolin Peng (72 papers)
Haitao Mi (56 papers)
Furong Huang (150 papers)
Dong Yu (328 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1850418657840832670

https://twitter.com/arXivGPT/status/1845202123098575286

https://twitter.com/arXivGPT/status/1845930469214638446

https://twitter.com/arXivGPT/status/1845568040437428530

https://twitter.com/AleefMahmud/status/1847362519112761542