MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning
(2502.18439v1)
Published 25 Feb 2025 in cs.AI
Abstract: Leveraging multiple LLMs to build collaborative multi-agentic workflows has demonstrated significant potential. However, most previous studies focus on prompting the out-of-the-box LLMs, relying on their innate capability for collaboration, which may not improve LLMs' performance as shown recently. In this paper, we introduce a new post-training paradigm MAPoRL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning), to explicitly elicit the collaborative behaviors and further unleash the power of multi-agentic LLM frameworks. In MAPoRL, multiple LLMs first generate their own responses independently and engage in a multi-turn discussion to collaboratively improve the final answer. In the end, a MAPoRL verifier evaluates both the answer and the discussion, by assigning a score that verifies the correctness of the answer, while adding incentives to encourage corrective and persuasive discussions. The score serves as the co-training reward, and is then maximized through multi-agent RL. Unlike existing LLM post-training paradigms, MAPoRL advocates the co-training of multiple LLMs together using RL for better generalization. Accompanied by analytical insights, our experiments demonstrate that training individual LLMs alone is insufficient to induce effective collaboration. In contrast, multi-agent co-training can boost the collaboration performance across benchmarks, with generalization to unseen domains.
MAPoRL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning) introduces a post-training paradigm designed to explicitly cultivate collaborative capabilities among multiple LLMs. The core motivation stems from the observation that simply prompting pre-trained LLMs for collaboration often fails to yield substantial performance improvements, as these models lack inherent, fine-tuned collaborative skills. MAPoRL addresses this by employing a multi-agent reinforcement learning (MARL) approach to co-train several LLM agents, optimizing their ability to interact, discuss, and collectively refine solutions. The framework centers around a cycle of independent generation, multi-turn discussion, and verification-based reward assignment, driving the agents towards more effective collaborative problem-solving.
Methodology and Architecture
The MAPoRL process unfolds in several distinct stages:
Independent Response Generation: Given an input prompt or problem x, each agent i in the set of N agents independently generates an initial response yi(0). This leverages the base capabilities of the individual LLMs.
yi(0)∼πθi(⋅∣x)
where πθi is the policy (LLM) of agent i parameterized by θi.
Multi-Turn Discussion: Following initial generation, the agents engage in a structured, multi-turn discussion process. Let the state at turn t be st, encompassing the problem x and the dialogue history ht=(y1(0),...,yN(0),...,y1(t−1),...,yN(t−1)). At each turn t, each agent i generates a message or refined response yi(t) based on the current state:
yi(t)∼πθi(⋅∣st)
This discussion continues for a predetermined number of turns T. The final output is typically derived from the collective discussion, often synthesized or selected from the agents' final contributions, denoted as yfinal.
Verification and Reward Assignment: A crucial component is the MAPoRL verifier, Vϕ, parameterized by ϕ. This verifier assesses both the quality of the final answer yfinal and the effectiveness of the preceding discussion hT. It outputs a composite score R:
R=Vϕ(x,yfinal,hT)=Ranswer+λRdiscussion
* Ranswer: Evaluates the correctness, accuracy, or quality of the final proposed solution yfinal with respect to the input x.
* Rdiscussion: Provides incentives for desirable collaborative behaviors during the discussion phase hT. The paper suggests rewarding corrective interactions (identifying and fixing errors) and persuasive arguments (effectively communicating correct reasoning). The hyperparameter λ balances the contribution of answer quality and discussion quality to the total reward.
Multi-Agent Reinforcement Learning: The composite score R serves as the reward signal for updating the policies (parameters θi) of all participating LLM agents. The objective is to maximize the expected reward across the agents:
J(θ1,...,θN)=Ex,{yi(t)}i,t[R]
This co-training is performed using MARL algorithms, such as multi-agent policy gradient methods (e.g., MAPPO, Independent PPO applied concurrently). The key distinction from single-agent RL fine-tuning (like RLHF) is the simultaneous update of multiple interacting policies based on a shared collaborative task reward.
The architecture involves N distinct LLM agents, potentially initialized from the same pre-trained model but diverging during co-training, and a separate verifier model. The verifier itself needs to be trained or defined to accurately assess both answer quality and nuanced discussion dynamics.
Co-Training vs. Individual Training
A central claim of the paper, supported by experimental results and analytical insights, is the insufficiency of training individual LLMs in isolation to foster effective collaboration. While individual training (e.g., fine-tuning each agent separately on a dataset of problems or even on simulated dialogues) might improve standalone performance, it fails to capture the interactive dynamics essential for genuine collaboration. Co-training, where agents learn and adapt concurrently based on the outcomes of their interactions, is presented as necessary. The shared MARL objective forces agents to develop complementary strategies, learn to interpret and respond to each other's outputs, and converge towards mutually beneficial interaction patterns that enhance collective performance. The paper provides theoretical arguments suggesting that the optimal collaborative policy profile may not be reachable through independent optimization pathways.
Experimental Validation and Findings
MAPoRL was evaluated on various benchmarks, including mathematical reasoning (GSM8K, MATH) and potentially other tasks requiring multi-step reasoning or diverse perspectives.
Performance: Experiments demonstrate that LLMs co-trained using MAPoRL significantly outperform baselines, including individually trained agents and prompted pre-trained LLMs, on collaborative tasks. The framework leads to higher accuracy in final answers.
Collaboration Quality: Analysis of the discussion phase indicates that MAPoRL encourages more meaningful interactions, characterized by agents correcting each other's mistakes and building upon valid points, aligning with the incentives provided by Rdiscussion.
Generalization: The collaborative skills learned via MAPoRL show generalization capabilities to unseen domains or task variations not encountered during the co-training phase, suggesting that the framework instills more fundamental collaboration principles rather than task-specific heuristics.
Ablation Studies: Ablations likely confirmed the importance of both the multi-turn discussion phase and the specific components of the reward function (Ranswer and Rdiscussion). Removing the discussion reward component (λ=0) would likely diminish the quality of interaction, even if final answer accuracy remains comparable in some cases.
Implementation and Practical Considerations
Implementing MAPoRL involves several practical aspects:
MARL Algorithm Choice: Standard MARL algorithms like MAPPO or IPPO can be adapted. The choice depends on the desired trade-off between sample efficiency, stability, and computational overhead. Policy gradient methods are common for LLM fine-tuning.
Verifier Design and Training: The verifier Vϕ is critical. It could be another LLM fine-tuned on human-annotated data (similar to reward models in RLHF) or based on programmatic checks (e.g., unit tests for code generation, solution checkers for math problems). Training a reliable verifier that captures subtle discussion qualities (correctiveness, persuasiveness) is challenging and may require significant annotation effort.
Computational Resources: Co-training N LLMs using MARL is computationally intensive. It requires managing multiple model instances, potentially large replay buffers (depending on the RL algorithm), and parallelized gradient computations and updates. The cost scales roughly linearly or super-linearly with the number of agents N.
Discussion Structure: The format and length (T) of the multi-turn discussion need careful design. Unstructured dialogues can be inefficient, while overly rigid structures might limit emergent collaborative strategies.
Scalability: Scaling MAPoRL to a large number of agents (N≫2) presents challenges in terms of computational cost, communication overhead during discussion, and potentially diminishing returns or increased complexity in coordinating strategies.
Deployment: Systems trained with MAPoRL could be deployed as ensembles where multiple specialized agents collaborate on complex tasks. This might involve orchestration layers to manage the interaction flow and synthesize the final output.
Conclusion
MAPoRL presents a structured approach for enhancing the collaborative abilities of LLMs through multi-agent post-co-training with reinforcement learning (Park et al., 25 Feb 2025). By combining independent generation, multi-turn discussion, and a carefully designed verification and reward mechanism, it guides multiple agents to learn effective interaction strategies. The emphasis on co-training, as opposed to individual fine-tuning, appears critical for developing genuine collaborative intelligence, leading to improved performance and generalization on complex tasks requiring collective problem-solving. While computationally demanding, MAPoRL offers a pathway towards building more capable and synergistic multi-agent LLM systems.