Building Math Agents with Multi-Turn Iterative Preference Learning (2409.02392v1)

Published 4 Sep 2024 in cs.LG and stat.ML

Abstract: Recent studies have shown that LLMs' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various LLMs using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model's performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.

PDF Abstract

Building Math Agents with Multi-Turn Iterative Preference Learning

Overview

The paper "Building Math Agents with Multi-Turn Iterative Preference Learning" explores the enhancement of LLMs' (LLM) capabilities in solving mathematical problems through multi-turn reasoning and the use of external tools, specifically code interpreters. It presents an innovative multi-turn direct preference learning framework that integrates Reinforcement Learning from Human Feedback (RLHF) to optimize the models’ performance on mathematical reasoning tasks.

Contributions

The paper makes several key contributions:

Multi-Turn Direct Preference Learning Framework:
- It introduces multi-turn direct preference optimization (M-DPO) and multi-turn kernelized transition optimization (M-KTO), methods designed to handle multi-turn reasoning with tool integration. These methods build on trajectory-level preferences to refine the model’s decision-making process based on feedback.
Empirical Validation:
- The framework's efficacy is demonstrated through extensive experiments on multiple base models, showing significant improvements over the supervised fine-tuning (SFT) counterparts. Specifically, the paper reports improved accuracies on GSM8K and MATH benchmarks: the Gemma-1.1-it-7B model's performance was elevated from 77.5\% to 83.9\% on GSM8K and from 46.1\% to 51.2\% on MATH. The Gemma-2-it-9B model achieved 86.3\% on GSM8K and 54.5\% on MATH.

Theoretical Foundations

The theoretical backbone of the paper is grounded in formulating the learning task as a Markov decision process (MDP), modifying the original RLHF algorithms to accommodate the complexities of multi-turn interactions and the integration of external tools. The paper leverages concepts like the KL-regularized optimization and Gibbs distributions to iteratively refine policies. This approach ensures that each step in the multi-turn process is optimized based on trajectory-level feedback, thus enhancing the model’s problem-solving accuracy.

Practical Implications

The practical implications of this research are significant. By integrating external tools such as Python code interpreters, the enhanced LLMs can perform complex computations and logical reasoning steps that are typically challenging for pure natural language processing models. This tool usage is crucial in academic contexts and professional domains where accurate mathematical problem-solving is required.

Future Developments

This research opens several avenues for future exploration:

Integration of More Advanced Tools: Future work could explore the integration of other advanced computational tools to expand the models' capabilities further.
Step-Wise Reward Signals: Incorporating AI feedback or process-supervised reward models could provide more granular reward signals, potentially improving the learning efficiency and performance.
General AI Learning Beyond Mathematics: Extending the framework to handle more complex external environments, dynamic opponents, or stochastic transitions could broaden the applicability to general AI agent learning.

Conclusion

The paper offers a robust approach to enhancing mathematical reasoning in LLMs through multi-turn iterative preference learning. By effectively integrating RLHF with advanced computational tools, the proposed methods achieve substantial improvements in model performance on benchmark datasets. The findings pave the way for more sophisticated applications of LLMs in domains requiring complex reasoning and interaction with external tools. Theoretical insights and practical results alike underscore the potential of this framework to drive future developments in AI.

Overall, "Building Math Agents with Multi-Turn Iterative Preference Learning" provides a comprehensive and insightful contribution to the field, demonstrating how RLHF can be adapted and extended to meet the evolving challenges of mathematical problem-solving using LLMs.