Multiagent Finetuning Strategies

Updated 1 December 2025

Multiagent finetuning is the process of jointly updating multiple agents using coordinated protocols and inter-agent communication for enhanced cooperation.
Frameworks such as MARFT, MAGRPO, and online joint fine-tuning utilize Dec-POMDPs, goal-conditioned RL, and computation flows to optimize agent performance.
Empirical results demonstrate significant improvements in sample efficiency, coordination, and output diversity across challenging RL tasks and LLM applications.

Multiagent finetuning refers to the process of jointly or coordinately updating the parameters of multiple agents—typically LLMs or reinforcement learning policies—using interaction data, environmental feedback, or structured collaborative protocols. This paradigm extends single-agent finetuning by leveraging inter-agent communication, coordination, or specialization to induce behaviors such as robust cooperation, reasoning-chain diversity, alignment, or sample-efficient transfer between tasks. Multiagent finetuning has become central in advancing the capabilities of multi-agent systems across domains including cooperative RL, LLM societies, tool use, and complex workflow orchestration.

1. Formal Underpinnings and Multiagent Optimization Objectives

Multiagent finetuning objectives are architected around multi-agent Markov Decision Processes (MDPs), DECentralized Partially Observable Markov Decision Processes (Dec-POMDPs), or computation graphs (Flows), depending on the agent and task structure. Let $N$ agents with policies $\{\pi^i_{\theta_i}\}_{i=1}^N$ parameterized by $\theta_i$ , operate in a shared environment, potentially with partial or local observability $o^i_t$ at time $t$ , and a reward function $R$ that depends on (global or local) state–action tuples.

The central multi-agent optimization objective is to maximize the cumulative (discounted) return: $J(\{\theta_i\}) = \mathbb{E}_{\{\pi^i_{\theta_i}\}}\Bigl[\sum_{t=0}^T \gamma^t R(o_t, a_t)\Bigr]$ In multiagent LLM systems, rewards are often composite metrics encoding structure, correctness, and cooperation yields (Liu et al., 6 Aug 2025), while in multiagent RL, both global and agent-specific rewards are used (Liao et al., 21 Apr 2025, Castagna, 26 Jan 2025).

Key formalisms used:

Dec-POMDPs: agents act on histories of local observations; decentralized policies (Liu et al., 6 Aug 2025).
Flex-POMDP (for LLM-based MAS): augments Dec-POMDP with a dependency function $D$ indicating asynchronous, profile-aware dependencies among agent actions (Liao et al., 21 Apr 2025).
Goal-conditioned RL: policies $\pi_\theta(a|s,g)$ explicitly conditioned on a goal state $g$ ; supports both pretraining and multiagent transfer (Zeng et al., 3 Jun 2024).
Flow graphs: directed (possibly cyclic) computation graphs of interacting agents; solution constructed by iterative node (agent) invocations (Mineiro, 6 Jun 2024).

2. Core Multiagent Finetuning Paradigms and Algorithms

Several established frameworks instantiate multiagent finetuning, differing in data generation, update rules, and feedback protocols.

A. Multi-Agent Reinforcement Fine-Tuning (MARFT)

Asynchronous, role-driven PPO-style trust-region updates, using LoRA adapters, with custom profile prompts for each agent; typically applied in LLM-based agentic tasks (Liao et al., 21 Apr 2025).
GAE (Generalized Advantage Estimation) for per-agent advantage computation.
Sequential/agent-by-agent updates respect inter-agent output dependencies via a dependency function in Flex-POMDP.

Example pseudocode structure:

for episode in episodes:
    for t in horizon:
        for i in agents:
            o_i = build_obs(...)
            a_i = pi_theta_i(o_i)
        r_t = env.step([a_1, ..., a_N])
    update_critics()
    update_policies()

B. Multi-Agent Group Relative Policy Optimization (MAGRPO) for LLM Collaboration

Each agent samples $G$ group rollouts; group-relative standardized advantages model peer-relative progress.
PPO-style clipped update, no global critic; only decentralized policies and group returns (Liu et al., 6 Aug 2025).
Well-suited for collaborative LLM composition, e.g. summarization or code writing.

C. Online Joint Fine-Tuning of Flows

Flow agents jointly updated by reducing preferences over entire computation episodes to node-level pairwise preferences using “one-step deviation” rollouts with a simulator (Mineiro, 6 Jun 2024).
Surrogate DPO-style (Direct Preference Optimization) loss at the node level:

$L_{\text{node}}(\theta_i) = -\log\sigma\big(\beta[\log\pi_{\theta_i}(o^+|s)-\log\pi_{\theta_i}(o^-|s)]\big)$

Extensible to reward-free settings using external episode judge models.

D. Multiagent Debate and Self-Improvement

Multiagent societies (e.g., LLMs) generate synthetic enriched data by interacting (debate, critic roles); each agent is finetuned only on data it “knows” (matches consensus or demonstrates correction) (Subramaniam et al., 10 Jan 2025).
Inter-agent KL penalties optionally encourage diversity.

E. Transfer Learning with Temporal Contrastive and Goal-Conditioned Policies

Goal-conditioned pretraining on source tasks + fine-tuning in the new (target) domain, with temporal contrastive learning for unsupervised discovery of subgoal bottlenecks (Zeng et al., 3 Jun 2024).
Clustering embeddings yields a planning graph $G$ for agent path planning through abstract subgoals.

F. Correction-Focused Fine-Tuning for Large-Scale Pathfinding

Selective expert labeling of “hard” trajectory segments (delta-data generation) to actively target failure modes (Andreychuk et al., 30 Jun 2025).
Fine-tuning distributed data mixtures to decrease catastrophic forgetting.

3. Architectural Principles, Implementation, and Hyperparameters

Agents are typically implemented as:

LLMs enhanced with LoRA/PTuning adapters (parameter-efficient finetuning) (Liang et al., 2 Apr 2024, Mineiro, 6 Jun 2024, Liao et al., 21 Apr 2025).
RL agents with policy/value networks (MLPs, transformers), UVFA critics for GCRL (Zeng et al., 3 Jun 2024, Castagna, 26 Jan 2025).
Non-autoregressive transformer encoders for high-throughput MAPF solvers (Andreychuk et al., 30 Jun 2025).

Key hyperparameters include:

Learning rates ( $10^{-6}$ to $10^{-4}$ , lower for RL/LORA, higher for SFT).
PPO/actor–critic settings (clip $\epsilon=0.2$ , GAE $\lambda=0.95$ , entropy coefficients, batch sizes 4–64).
Contrastive and clustering hyperparameters (negatives per anchor, K for K-means, InfoNCE temperature).
Role prompt designs (“system messages” per agent profile, crucial for agent distinction and performance in LLM-based MARFT and CMAT) (Liang et al., 2 Apr 2024, Liao et al., 21 Apr 2025).

Adaptation to the multi-agent setting requires:

Construction of joint observation/action buffers.
Explicit dependency tracking (Flex-POMDP, role-based computation graphs).
Jointly optimized, but parameter-distinct, agent modules for specialization/diversity.

4. Empirical Results and Benchmark Achievements

Multiagent finetuning has demonstrated:

Dramatic gains in sample efficiency and convergence speed: e.g., temporal contrastive framework required only 21.7% as many samples as the best baseline for Overcooked multi-agent tasks (Zeng et al., 3 Jun 2024); MAPF-GPT-DDG achieved $+$ 20pp success-rate gain and 2x faster loss drop over DAgger (Andreychuk et al., 30 Jun 2025).
Substantial improvement in coordination and final metrics: e.g., MOAT increased step success-rate in Mind2Web from 41.1% to 44.1% (held-in) and 49.6% to 52.3% (held-out) on Llama-2 (Zhu et al., 11 Sep 2025).
Maintenance and even enhancement of output diversity and specialization: multiagent societies preserved diverse reasoning chains and improved accuracy over single-agent FT on MATH (46.8→60.6 for GPT-3.5) (Subramaniam et al., 10 Jan 2025).
Scaling: MAPF-GPT-DDG reached unprecedented agent counts ( $10^6$ ) with decision times $\approx163\ \mu$ s/agent, outperforming far larger models (Andreychuk et al., 30 Jun 2025).
Near parity or outperformance versus proprietary large models: e.g., TinyAgent-7B + CMAT scored 28.2 overall vs. GPT-3.5 at 28.2 on AgentBench (Liang et al., 2 Apr 2024).

Representative quantitative results:

Method/Domain	Main Metric	SOTA Baseline	Multiagent FT	Improvement
Overcooked (2-A GCRL)	Steps to conv.	3–9M	0.68–1.3M	4.6x sample eff.
CoopHumanEval (coding)	Return (%)	63	88.1	+25.1 pts
MATH (GPT-3.5)	Accuracy (%)	46.8	60.6	+13.8 pts
MAPF (Maze)	Success Rate (%)	–	+5–15pp	2x validation accel
Mind2Web (MOAT)	Step-success (%)	41.1	44.1	+3.0 pts

5. Specialization, Diversity, and Interpretability in Multiagent Finetuning

Distinctive to multiagent finetuning is the capacity for agent specialization, diversity retention, and explicit division of labor.

Specialization is induced by agent-specific data generation (e.g., only self-consistent debate pairs) and consensus filtering (Subramaniam et al., 10 Jan 2025).
Diversity is measured by increased KL divergence among output distributions and lower consensus compared to single-agent FT (multiagent FT maintains ∼25% diversity in outputs) (Subramaniam et al., 10 Jan 2025).
Interpretability is advanced by subgoal discovery: temporal contrastive learning yields clusters (graph nodes) with clear semantic roles (e.g., "fetch onion," "load oven") (Zeng et al., 3 Jun 2024).
In collaborative frameworks, iterative alignment and alternation (e.g., MOAT planning-grounding loops) reduces capability gaps and harmonizes agent outputs (Zhu et al., 11 Sep 2025).

6. Practical Guidelines, Scalability, and Open Research Directions

Best practices and limitations as established:

Employ at least 3+3 agent societies for LLM debate/critique; use role prompts and modular memory (Subramaniam et al., 10 Jan 2025, Liang et al., 2 Apr 2024).
Interleave supervised (cross-entropy) and RL (actor-critic) updates; mix new with original data to mitigate forgetting (Liang et al., 2 Apr 2024, Zhu et al., 11 Sep 2025).
Exploit delta-driven and online transfer strategies to actively correct hardest states (Andreychuk et al., 30 Jun 2025, Castagna, 26 Jan 2025).
Tune clustering, subgoal extraction, and sampling hyperparameters to task class complexity (Zeng et al., 3 Jun 2024).

Scalability and extensibility:

Role-based modularity (e.g., CMAT, MARFT) permits addition of agents (e.g., Planner, Verifier) and scaling to $N>2$ (Liang et al., 2 Apr 2024, Liao et al., 21 Apr 2025).
Multiagent pathfinding and RL can be scaled to $10^6$ agents and beyond (Andreychuk et al., 30 Jun 2025).
Limitations include environment/sample efficiency bottlenecks, memory costs ( $O(N)$ in inference/training passes), and the need for richer dynamic environments and unified communication standards (Liao et al., 21 Apr 2025).

Notable open research directions:

Extending text-only frameworks to multimodal agents (vision, code, robotics) (Zhu et al., 11 Sep 2025).
Incorporation of hierarchical, population-level, and leader-follower structures (Liu et al., 6 Aug 2025, Liao et al., 21 Apr 2025).
Adaptive and dynamic communication protocols for large agent societies.
Unified MARFT frameworks bridging RLHF, MARL, and PEFT methodologies (Liao et al., 21 Apr 2025).

7. Summary Table of Representative Multiagent Finetuning Frameworks

Paper/Framework	Domain	Key Mechanism	Main Empirical Gains
MARFT (Liao et al., 21 Apr 2025)	LLM-based MAS	PPO RL, Flex-POMDP, LoRA	Monotonic accuracy/coordination improvement
MAGRPO (Liu et al., 6 Aug 2025)	LLM collaboration	Group-rel. PPO, joint reward	$\sim$ 200 tok/s, +40–50% returns, coop@k gains
Joint Alignment (MOAT) (Zhu et al., 11 Sep 2025)	Planning + Grounding	Alternating DPO + SFT	+3–4% on held-in/-out, reduces cap. gap
Temporal Contrastive RL (Zeng et al., 3 Jun 2024)	Multi-agent RL transfer	GCRL + InfoNCE subgoal discovery	4.6× sample efficiency over best baseline
Pathfinding DDG (Andreychuk et al., 30 Jun 2025)	MAPF	Delta-error data gen, BC mixed	Outperforms $50\times$ models; 1M agents
Multiagent Debate (Subramaniam et al., 10 Jan 2025)	LLM reasoning	Per-agent FT on consensus	+10–14 pts on MATH, enhanced reasoning diversity
CMAT (Liang et al., 2 Apr 2024)	Small LLMs, agent tasks	Multi-role RL, SFT, memory	Parity with GPT-3.5 with $<10^9$ params
EF-OnTL (Castagna, 26 Jan 2025)	RL online TL	Expert-free buffer transfer, UE	3%+ global gain, faster convergence

Each framework builds on the core principle that agent interaction—via debate, alignment, memory integration, or error-driven transfer—expands the frontiers of what finetuned multiagent systems can achieve relative to single-agent or independently updated agents, and classic staged tuning approaches.