- The paper introduces Mars-PO, a multi-agent system framework extending DPO for enhancing LLM mathematical reasoning capabilities.
- Mars-PO achieves significant performance gains on mathematical reasoning benchmarks, notably increasing accuracy on the MATH dataset for Llama3.1-8B-Instruct.
- Mars-PO provides an efficient multi-agent preference optimization framework, offering a practical alternative to resource-intensive methods for improving LLM mathematical reasoning.
Mars-PO: Multi-Agent Reasoning System Preference Optimization
The paper introduces Mars-PO, a novel framework designed to advance mathematical reasoning capabilities in LLMs by employing a multi-agent system preference optimization strategy. The approach addresses inherent challenges faced by LLMs in mathematical reasoning tasks, such as errors, hallucinations, and inconsistencies, particularly prevalent in multi-step problem-solving scenarios. The principal contribution of Mars-PO lies in its ability to integrate diverse outputs from various agents, creating a system that aligns shared strengths while mitigating individual weaknesses.
Key Contributions
- Multi-Agent Framework: Mars-PO extends the Direct Preference Optimization (DPO) method to a multi-agent setting. By utilizing the collaborative potential of multiple agents, it forms a hybrid positive sample set derived from the most effective outputs across agents. These positive outputs are paired with agent-specific negative samples, allowing for tailored preference training.
- Improvement on Benchmarks: The framework shows substantial performance gains on mathematical reasoning benchmarks such as the MATH dataset, where it elevates accuracy using the Llama3.1-8B-Instruct model from 50.38% to 57.82%. This highlights the framework's potential to handle challenging reasoning tasks more effectively than traditional single-agent systems or baseline methods like supervised fine-tuning or vanilla DPO.
- Operation Mechanics: The Mars-PO process consists of three crucial stages:
- Response Samples Generation: Multiple agents generate responses to prompts, which are then classified as positive or negative based on correctness.
- Preference Pairs Construction: A reward model is employed to extract high-quality positive samples, forming a hybrid set across all agents. These samples are paired with negative samples unique to each agent, addressing both mutual strengths and individual weaknesses.
- Hybrid Preference Optimization: Using preference pairs, LLM agents are iteratively trained, resulting in robust enhancements in reasoning accuracy.
Implications and Future Work
The implications of Mars-PO are significant for both practical applications and theoretical advancements in AI. Practically, it offers a more efficient pathway to improve the mathematical reasoning capabilities of LLMs without resorting to resource-intensive alternatives like Reinforcement Learning from Human Feedback. Theoretically, it provides insights into the value of multi-agent systems and collaborative optimization, suggesting avenues for further research in system alignment and preference-based learning.
In terms of future developments, Mars-PO opens pathways for the exploration of more sophisticated multi-agent interactions and preference training methodologies in AI. Iterative optimization and hybrid sample selection are promising areas for further exploration, with potential applicability beyond mathematical reasoning into other complex problem-solving domains. Additionally, refining the integration between different agent outputs to drive further gains in accuracy and performance could provide even more robust LLMs.
In summary, Mars-PO represents a significant step in preference-based optimization, offering an efficient approach to enhance the capability of LLMs in domains that require complex mathematical reasoning. The results and methodologies outlined in this paper are likely to influence subsequent exploration and the development of advanced multi-agent reasoning systems.