Mars-PO: Multi-Agent Reasoning System Preference Optimization (2411.19039v1)

Published 28 Nov 2024 in cs.AI

Abstract: Mathematical reasoning is a fundamental capability for LLMs, yet achieving high performance in this domain remains a significant challenge. The auto-regressive generation process often makes LLMs susceptible to errors, hallucinations, and inconsistencies, particularly during multi-step reasoning. In this paper, we propose Mars-PO, a novel framework to improve the mathematical reasoning capabilities of LLMs through a multi-agent system. It combines high-quality outputs from multiple agents into a hybrid positive sample set and pairs them with agent-specific negative samples to construct robust preference pairs for training. By aligning agents with shared positive samples while addressing individual weaknesses, Mars-PO achieves substantial performance improvements on mathematical reasoning benchmarks. For example, it increases the accuracy on the MATH benchmark of the state-of-the-art instruction-tuned LLM, Llama3.1-8B-Instruct, from 50.38% to 57.82%. Experimental results further demonstrate that our method consistently outperforms other baselines, such as supervised fine-tuning, vanilla DPO, and its enhanced versions, highlighting the effectiveness of our approach.

Summary

The paper introduces Mars-PO, a multi-agent system framework extending DPO for enhancing LLM mathematical reasoning capabilities.
Mars-PO achieves significant performance gains on mathematical reasoning benchmarks, notably increasing accuracy on the MATH dataset for Llama3.1-8B-Instruct.
Mars-PO provides an efficient multi-agent preference optimization framework, offering a practical alternative to resource-intensive methods for improving LLM mathematical reasoning.

Mars-PO: Multi-Agent Reasoning System Preference Optimization

The paper introduces Mars-PO, a novel framework designed to advance mathematical reasoning capabilities in LLMs by employing a multi-agent system preference optimization strategy. The approach addresses inherent challenges faced by LLMs in mathematical reasoning tasks, such as errors, hallucinations, and inconsistencies, particularly prevalent in multi-step problem-solving scenarios. The principal contribution of Mars-PO lies in its ability to integrate diverse outputs from various agents, creating a system that aligns shared strengths while mitigating individual weaknesses.

Key Contributions

Multi-Agent Framework: Mars-PO extends the Direct Preference Optimization (DPO) method to a multi-agent setting. By utilizing the collaborative potential of multiple agents, it forms a hybrid positive sample set derived from the most effective outputs across agents. These positive outputs are paired with agent-specific negative samples, allowing for tailored preference training.
Improvement on Benchmarks: The framework shows substantial performance gains on mathematical reasoning benchmarks such as the MATH dataset, where it elevates accuracy using the Llama3.1-8B-Instruct model from 50.38% to 57.82%. This highlights the framework's potential to handle challenging reasoning tasks more effectively than traditional single-agent systems or baseline methods like supervised fine-tuning or vanilla DPO.
Operation Mechanics: The Mars-PO process consists of three crucial stages:
- Response Samples Generation: Multiple agents generate responses to prompts, which are then classified as positive or negative based on correctness.
- Preference Pairs Construction: A reward model is employed to extract high-quality positive samples, forming a hybrid set across all agents. These samples are paired with negative samples unique to each agent, addressing both mutual strengths and individual weaknesses.
- Hybrid Preference Optimization: Using preference pairs, LLM agents are iteratively trained, resulting in robust enhancements in reasoning accuracy.

Implications and Future Work

The implications of Mars-PO are significant for both practical applications and theoretical advancements in AI. Practically, it offers a more efficient pathway to improve the mathematical reasoning capabilities of LLMs without resorting to resource-intensive alternatives like Reinforcement Learning from Human Feedback. Theoretically, it provides insights into the value of multi-agent systems and collaborative optimization, suggesting avenues for further research in system alignment and preference-based learning.

In terms of future developments, Mars-PO opens pathways for the exploration of more sophisticated multi-agent interactions and preference training methodologies in AI. Iterative optimization and hybrid sample selection are promising areas for further exploration, with potential applicability beyond mathematical reasoning into other complex problem-solving domains. Additionally, refining the integration between different agent outputs to drive further gains in accuracy and performance could provide even more robust LLMs.

In summary, Mars-PO represents a significant step in preference-based optimization, offering an efficient approach to enhance the capability of LLMs in domains that require complex mathematical reasoning. The results and methodologies outlined in this paper are likely to influence subsequent exploration and the development of advanced multi-agent reasoning systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1864489280627986584

https://twitter.com/gm8xx8/status/1863507789580062984

https://twitter.com/GptMaestro/status/1865619987253342605