Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning (2410.22304v1)

Published 29 Oct 2024 in cs.CL and cs.LG

Abstract: Mathematical reasoning is a crucial capability for LLMs, yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning \textbf{Flows}. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .
  2. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985 .
  3. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 .
  4. V-star: Training verifiers for self-taught reasoners. arXiv preprint arXiv:2402.06457 .
  5. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629 .
  6. Let’s verify step by step. arXiv preprint arXiv:2305.20050 .
  7. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint arXiv:2405.12209 .
  8. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 .
  9. Mineiro, P. (2024). Online joint fine-tuning of multi-agent flows. arXiv preprint arXiv:2406.04516 .
  10. Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733 .
  11. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36.
  12. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585 .
  13. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  14. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. arXiv preprint arXiv:2402.02658 .
  15. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 .
  16. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825 .
  17. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629 .
  18. Star: Self-taught reasoner. arXiv preprint arXiv:2203.14465 .
  19. Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning. arXiv preprint arXiv:2410.02884 .
  20. Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816 .
  21. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624 .

Summary

  • The paper introduces a multi-agent Flow architecture that integrates Answer LLM and Stop LLM to iteratively refine mathematical reasoning.
  • The methodology uses online Direct Preference Optimization (DPO) with rollouts, driving consistent improvements in training accuracy on benchmarks.
  • Empirical results on GSM8K and MetaMath benchmarks demonstrate that Flow-DPO significantly outperforms traditional static inference methods.

Improving Mathematical Reasoning in LLMs through Flow-DPO

The paper "Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning" presents an innovative methodology to enhance the mathematical reasoning capabilities of LLMs. This research is positioned within the context of addressing the challenges faced by LLMs, specifically in generating detailed and logically consistent reasoning traces that are pivotal for accurate problem-solving in mathematical tasks. Prior efforts have largely relied on direct model inferences and static inference strategies to extract reasoning traces, which often resulted in sub-optimal outcomes due to limitations inherent in static model evaluations and insufficiently detailed reasoning steps. This paper introduces a dynamic and collaborative approach through the creation of Flows—multi-agent systems that leverage interactive communication between component LLMs.

Methodology

The core innovation rests on the deployment of a Flow that consists of multiple LLM components, where each model contributes iteratively to solving a problem. This is akin to a multi-agent system where collaborative dialogue among agents incrementally builds towards a solution. Within this Flow structure, two primary LLMs are designated: the Answer LLM, responsible for incrementally generating potential solutions, and the Stop LLM, tasked with evaluating the completeness of the solution at each step. In this framework, learning is facilitated via online Direct Preference Optimization (DPO) with rollouts, a process that enhances decision-making capabilities during the reasoning phase by generating batch DPO pairs from alternative rollouts at each decision point. This approach enables continuous learning and adaptation as the model processes more data over time, refining its reasoning structure and decision-making efficacy.

Results

Empirical evaluations demonstrate the improved performance of the Flow-based approach over traditional single LLM inference methods. Using mathematical reasoning benchmarks such as GSM8K and MetaMath, Flow-DPO exhibited superior progression in training accuracy over a number of trials, as evidenced by significant gains in progressive validation accuracy across multiple model architectures, including Llama-3-8B-Instruct and Phi-3-medium-128k-instruct. Importantly, when the Flow-generated reasoning traces were employed in supervised fine-tuning (SFT) tasks, they led to marked improvements in LLM performance, outperforming traces generated through direct model inference.

Implications and Future Work

The implications of this research are multidimensional. Practically, the Flow-based methodology circumvents the limitations of rigid, single-pass model inference strategies, offering a flexible and adaptive framework more suited to the nuanced demands of mathematical reasoning tasks. Theoretically, the approach underscores the significance of interactive and collaborative learning models in enhancing cognitive capabilities in artificial intelligence systems. By allowing for iterative refinement and the potential for on-the-fly learning adjustments, the Flow-DPO approach aligns closely with human-like problem-solving methods, providing a promising path for future exploration.

Future research directions could involve examining the scalability of Flow-DPO across even larger LLM architectures and different problem domains. Furthermore, extending the concept of multi-agent interactions to encompass other types of reasoning tasks, such as logical or commonsense reasoning, could reveal broader applications of this methodology. Continued exploration into interactive learning frameworks that utilize detailed reasoning feedback could further advance the capabilities of LLMs, honing their ability to handle complex reasoning tasks with greater accuracy and depth.