Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning (2410.22304v1)

Published 29 Oct 2024 in cs.CL and cs.LG

Abstract: Mathematical reasoning is a crucial capability for LLMs, yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning \textbf{Flows}. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.

References (21)

Summary

The paper introduces a multi-agent Flow architecture that integrates Answer LLM and Stop LLM to iteratively refine mathematical reasoning.
The methodology uses online Direct Preference Optimization (DPO) with rollouts, driving consistent improvements in training accuracy on benchmarks.
Empirical results on GSM8K and MetaMath benchmarks demonstrate that Flow-DPO significantly outperforms traditional static inference methods.

Improving Mathematical Reasoning in LLMs through Flow-DPO

The paper "Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning" presents an innovative methodology to enhance the mathematical reasoning capabilities of LLMs. This research is positioned within the context of addressing the challenges faced by LLMs, specifically in generating detailed and logically consistent reasoning traces that are pivotal for accurate problem-solving in mathematical tasks. Prior efforts have largely relied on direct model inferences and static inference strategies to extract reasoning traces, which often resulted in sub-optimal outcomes due to limitations inherent in static model evaluations and insufficiently detailed reasoning steps. This paper introduces a dynamic and collaborative approach through the creation of Flows—multi-agent systems that leverage interactive communication between component LLMs.

Methodology

The core innovation rests on the deployment of a Flow that consists of multiple LLM components, where each model contributes iteratively to solving a problem. This is akin to a multi-agent system where collaborative dialogue among agents incrementally builds towards a solution. Within this Flow structure, two primary LLMs are designated: the Answer LLM, responsible for incrementally generating potential solutions, and the Stop LLM, tasked with evaluating the completeness of the solution at each step. In this framework, learning is facilitated via online Direct Preference Optimization (DPO) with rollouts, a process that enhances decision-making capabilities during the reasoning phase by generating batch DPO pairs from alternative rollouts at each decision point. This approach enables continuous learning and adaptation as the model processes more data over time, refining its reasoning structure and decision-making efficacy.

Results

Empirical evaluations demonstrate the improved performance of the Flow-based approach over traditional single LLM inference methods. Using mathematical reasoning benchmarks such as GSM8K and MetaMath, Flow-DPO exhibited superior progression in training accuracy over a number of trials, as evidenced by significant gains in progressive validation accuracy across multiple model architectures, including Llama-3-8B-Instruct and Phi-3-medium-128k-instruct. Importantly, when the Flow-generated reasoning traces were employed in supervised fine-tuning (SFT) tasks, they led to marked improvements in LLM performance, outperforming traces generated through direct model inference.

Implications and Future Work

The implications of this research are multidimensional. Practically, the Flow-based methodology circumvents the limitations of rigid, single-pass model inference strategies, offering a flexible and adaptive framework more suited to the nuanced demands of mathematical reasoning tasks. Theoretically, the approach underscores the significance of interactive and collaborative learning models in enhancing cognitive capabilities in artificial intelligence systems. By allowing for iterative refinement and the potential for on-the-fly learning adjustments, the Flow-DPO approach aligns closely with human-like problem-solving methods, providing a promising path for future exploration.

Future research directions could involve examining the scalability of Flow-DPO across even larger LLM architectures and different problem domains. Furthermore, extending the concept of multi-agent interactions to encompass other types of reasoning tasks, such as logical or commonsense reasoning, could reveal broader applications of this methodology. Continued exploration into interactive learning frameworks that utilize detailed reasoning feedback could further advance the capabilities of LLMs, honing their ability to handle complex reasoning tasks with greater accuracy and depth.

PDF Markdown

Related Papers

Tweets

https://twitter.com/gm8xx8/status/1851516669706789354

https://twitter.com/fly51fly/status/1851737997924012427

https://twitter.com/calculito/status/1851887882614587706

https://twitter.com/arXivGPT/status/1852074166263890147