LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning (2410.02884v2)

Published 3 Oct 2024 in cs.AI and cs.CL

Abstract: This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of LLMs. The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

PDF HTML Abstract

LLaMA-Berry: Advancements in Olympiad-Level Mathematical Reasoning

The paper "LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning" presents a comprehensive framework designed to enhance the mathematical reasoning capabilities of LLMs, emphasizing the resolution of complex, Olympiad-caliber mathematical problems. In essence, the work elaborates on the integration of Monte Carlo Tree Search (MCTS) with an iterative Self-Refine mechanism to optimize and refine the reasoning paths employed by LLMs.

Key Contributions

The framework, known as LLaMA-Berry, synthesizes elements from MCTS with a novel Self-Refine (SR) methodology. SR-MCTS improves upon traditional step-wise and greedy algorithmic paradigms by encouraging a more efficient exploration of solution spaces. This integration facilitates enhanced decision-making capabilities, addressing specific inefficiencies often encountered in conventional search algorithms.

Additional Methodologies

A significant enhancement in the LLaMA-Berry framework is the introduction of the Pairwise Preference Reward Model (PPRM). PPRM is inspired by Reinforcement Learning from Human Feedback (RLHF), and it evaluates solutions by modeling pairwise preferences. This model operates using an Enhanced Borda Count (EBC) method to globally rank solutions, effectively handling challenges related to scoring variability and non-independent distribution of outputs in mathematical reasoning tasks.

Evaluation and Benchmarking

The paper provides empirical evidence of the superior performance of LLaMA-Berry on both general and advanced mathematical problem-solving benchmarks. When tested against other established methodologies such as ToT and rStar, LLaMA-Berry demonstrates notable improvements in search efficiency and problem-solving accuracy. This is particularly evident in evaluations conducted on intricate Olympiad-level benchmarks like AIME24 and AMC23, where LLaMA-Berry's approach yielded more efficient and precise outcomes.

Implications and Future Directions

The enhancements introduced by LLaMA-Berry have significant theoretical and practical implications. The development of a robust framework for Olympiad-level mathematical problem-solving not only advances the utility of LLMs in specialized domains but also contributes to the broader body of knowledge on applying AI to complex reasoning tasks.

Future work could explore the extension of these methodologies to other domains where complex reasoning and decision-making are critical. This includes formalizing additional refinements to the SR-MCTS process and further enhancing the PPRM's efficacy across diverse problem categories. Such endeavors hold promising potential for enhancing AI systems' reasoning capabilities in both academic and real-world applications.