Papers
Topics
Authors
Recent
2000 character limit reached

Multiagent Finetuning Strategies

Updated 1 December 2025
  • Multiagent finetuning is the process of jointly updating multiple agents using coordinated protocols and inter-agent communication for enhanced cooperation.
  • Frameworks such as MARFT, MAGRPO, and online joint fine-tuning utilize Dec-POMDPs, goal-conditioned RL, and computation flows to optimize agent performance.
  • Empirical results demonstrate significant improvements in sample efficiency, coordination, and output diversity across challenging RL tasks and LLM applications.

Multiagent finetuning refers to the process of jointly or coordinately updating the parameters of multiple agents—typically LLMs or reinforcement learning policies—using interaction data, environmental feedback, or structured collaborative protocols. This paradigm extends single-agent finetuning by leveraging inter-agent communication, coordination, or specialization to induce behaviors such as robust cooperation, reasoning-chain diversity, alignment, or sample-efficient transfer between tasks. Multiagent finetuning has become central in advancing the capabilities of multi-agent systems across domains including cooperative RL, LLM societies, tool use, and complex workflow orchestration.

1. Formal Underpinnings and Multiagent Optimization Objectives

Multiagent finetuning objectives are architected around multi-agent Markov Decision Processes (MDPs), DECentralized Partially Observable Markov Decision Processes (Dec-POMDPs), or computation graphs (Flows), depending on the agent and task structure. Let NN agents with policies {πθii}i=1N\{\pi^i_{\theta_i}\}_{i=1}^N parameterized by θi\theta_i, operate in a shared environment, potentially with partial or local observability otio^i_t at time tt, and a reward function RR that depends on (global or local) state–action tuples.

The central multi-agent optimization objective is to maximize the cumulative (discounted) return: J({θi})=E{πθii}[t=0TγtR(ot,at)]J(\{\theta_i\}) = \mathbb{E}_{\{\pi^i_{\theta_i}\}}\Bigl[\sum_{t=0}^T \gamma^t R(o_t, a_t)\Bigr] In multiagent LLM systems, rewards are often composite metrics encoding structure, correctness, and cooperation yields (Liu et al., 6 Aug 2025), while in multiagent RL, both global and agent-specific rewards are used (Liao et al., 21 Apr 2025, Castagna, 26 Jan 2025).

Key formalisms used:

  • Dec-POMDPs: agents act on histories of local observations; decentralized policies (Liu et al., 6 Aug 2025).
  • Flex-POMDP (for LLM-based MAS): augments Dec-POMDP with a dependency function DD indicating asynchronous, profile-aware dependencies among agent actions (Liao et al., 21 Apr 2025).
  • Goal-conditioned RL: policies πθ(as,g)\pi_\theta(a|s,g) explicitly conditioned on a goal state gg; supports both pretraining and multiagent transfer (Zeng et al., 3 Jun 2024).
  • Flow graphs: directed (possibly cyclic) computation graphs of interacting agents; solution constructed by iterative node (agent) invocations (Mineiro, 6 Jun 2024).

2. Core Multiagent Finetuning Paradigms and Algorithms

Several established frameworks instantiate multiagent finetuning, differing in data generation, update rules, and feedback protocols.

A. Multi-Agent Reinforcement Fine-Tuning (MARFT)

  • Asynchronous, role-driven PPO-style trust-region updates, using LoRA adapters, with custom profile prompts for each agent; typically applied in LLM-based agentic tasks (Liao et al., 21 Apr 2025).
  • GAE (Generalized Advantage Estimation) for per-agent advantage computation.
  • Sequential/agent-by-agent updates respect inter-agent output dependencies via a dependency function in Flex-POMDP.
  • Example pseudocode structure:
    1
    2
    3
    4
    5
    6
    7
    8
    
    for episode in episodes:
        for t in horizon:
            for i in agents:
                o_i = build_obs(...)
                a_i = pi_theta_i(o_i)
            r_t = env.step([a_1, ..., a_N])
        update_critics()
        update_policies()

B. Multi-Agent Group Relative Policy Optimization (MAGRPO) for LLM Collaboration

  • Each agent samples GG group rollouts; group-relative standardized advantages model peer-relative progress.
  • PPO-style clipped update, no global critic; only decentralized policies and group returns (Liu et al., 6 Aug 2025).
  • Well-suited for collaborative LLM composition, e.g. summarization or code writing.

C. Online Joint Fine-Tuning of Flows

  • Flow agents jointly updated by reducing preferences over entire computation episodes to node-level pairwise preferences using “one-step deviation” rollouts with a simulator (Mineiro, 6 Jun 2024).
  • Surrogate DPO-style (Direct Preference Optimization) loss at the node level:

Lnode(θi)=logσ(β[logπθi(o+s)logπθi(os)])L_{\text{node}}(\theta_i) = -\log\sigma\big(\beta[\log\pi_{\theta_i}(o^+|s)-\log\pi_{\theta_i}(o^-|s)]\big)

  • Extensible to reward-free settings using external episode judge models.

D. Multiagent Debate and Self-Improvement

  • Multiagent societies (e.g., LLMs) generate synthetic enriched data by interacting (debate, critic roles); each agent is finetuned only on data it “knows” (matches consensus or demonstrates correction) (Subramaniam et al., 10 Jan 2025).
  • Inter-agent KL penalties optionally encourage diversity.

E. Transfer Learning with Temporal Contrastive and Goal-Conditioned Policies

  • Goal-conditioned pretraining on source tasks + fine-tuning in the new (target) domain, with temporal contrastive learning for unsupervised discovery of subgoal bottlenecks (Zeng et al., 3 Jun 2024).
  • Clustering embeddings yields a planning graph GG for agent path planning through abstract subgoals.

F. Correction-Focused Fine-Tuning for Large-Scale Pathfinding

3. Architectural Principles, Implementation, and Hyperparameters

Agents are typically implemented as:

Key hyperparameters include:

  • Learning rates (10610^{-6} to 10410^{-4}, lower for RL/LORA, higher for SFT).
  • PPO/actor–critic settings (clip ϵ=0.2\epsilon=0.2, GAE λ=0.95\lambda=0.95, entropy coefficients, batch sizes 4–64).
  • Contrastive and clustering hyperparameters (negatives per anchor, K for K-means, InfoNCE temperature).
  • Role prompt designs (“system messages” per agent profile, crucial for agent distinction and performance in LLM-based MARFT and CMAT) (Liang et al., 2 Apr 2024, Liao et al., 21 Apr 2025).

Adaptation to the multi-agent setting requires:

  • Construction of joint observation/action buffers.
  • Explicit dependency tracking (Flex-POMDP, role-based computation graphs).
  • Jointly optimized, but parameter-distinct, agent modules for specialization/diversity.

4. Empirical Results and Benchmark Achievements

Multiagent finetuning has demonstrated:

  • Dramatic gains in sample efficiency and convergence speed: e.g., temporal contrastive framework required only 21.7% as many samples as the best baseline for Overcooked multi-agent tasks (Zeng et al., 3 Jun 2024); MAPF-GPT-DDG achieved ++20pp success-rate gain and 2x faster loss drop over DAgger (Andreychuk et al., 30 Jun 2025).
  • Substantial improvement in coordination and final metrics: e.g., MOAT increased step success-rate in Mind2Web from 41.1% to 44.1% (held-in) and 49.6% to 52.3% (held-out) on Llama-2 (Zhu et al., 11 Sep 2025).
  • Maintenance and even enhancement of output diversity and specialization: multiagent societies preserved diverse reasoning chains and improved accuracy over single-agent FT on MATH (46.8→60.6 for GPT-3.5) (Subramaniam et al., 10 Jan 2025).
  • Scaling: MAPF-GPT-DDG reached unprecedented agent counts (10610^6) with decision times 163 μ\approx163\ \mus/agent, outperforming far larger models (Andreychuk et al., 30 Jun 2025).
  • Near parity or outperformance versus proprietary large models: e.g., TinyAgent-7B + CMAT scored 28.2 overall vs. GPT-3.5 at 28.2 on AgentBench (Liang et al., 2 Apr 2024).

Representative quantitative results:

Method/Domain Main Metric SOTA Baseline Multiagent FT Improvement
Overcooked (2-A GCRL) Steps to conv. 3–9M 0.68–1.3M 4.6x sample eff.
CoopHumanEval (coding) Return (%) 63 88.1 +25.1 pts
MATH (GPT-3.5) Accuracy (%) 46.8 60.6 +13.8 pts
MAPF (Maze) Success Rate (%) +5–15pp 2x validation accel
Mind2Web (MOAT) Step-success (%) 41.1 44.1 +3.0 pts

5. Specialization, Diversity, and Interpretability in Multiagent Finetuning

Distinctive to multiagent finetuning is the capacity for agent specialization, diversity retention, and explicit division of labor.

  • Specialization is induced by agent-specific data generation (e.g., only self-consistent debate pairs) and consensus filtering (Subramaniam et al., 10 Jan 2025).
  • Diversity is measured by increased KL divergence among output distributions and lower consensus compared to single-agent FT (multiagent FT maintains ∼25% diversity in outputs) (Subramaniam et al., 10 Jan 2025).
  • Interpretability is advanced by subgoal discovery: temporal contrastive learning yields clusters (graph nodes) with clear semantic roles (e.g., "fetch onion," "load oven") (Zeng et al., 3 Jun 2024).
  • In collaborative frameworks, iterative alignment and alternation (e.g., MOAT planning-grounding loops) reduces capability gaps and harmonizes agent outputs (Zhu et al., 11 Sep 2025).

6. Practical Guidelines, Scalability, and Open Research Directions

Best practices and limitations as established:

Scalability and extensibility:

  • Role-based modularity (e.g., CMAT, MARFT) permits addition of agents (e.g., Planner, Verifier) and scaling to N>2N>2 (Liang et al., 2 Apr 2024, Liao et al., 21 Apr 2025).
  • Multiagent pathfinding and RL can be scaled to 10610^6 agents and beyond (Andreychuk et al., 30 Jun 2025).
  • Limitations include environment/sample efficiency bottlenecks, memory costs (O(N)O(N) in inference/training passes), and the need for richer dynamic environments and unified communication standards (Liao et al., 21 Apr 2025).

Notable open research directions:

7. Summary Table of Representative Multiagent Finetuning Frameworks

Paper/Framework Domain Key Mechanism Main Empirical Gains
MARFT (Liao et al., 21 Apr 2025) LLM-based MAS PPO RL, Flex-POMDP, LoRA Monotonic accuracy/coordination improvement
MAGRPO (Liu et al., 6 Aug 2025) LLM collaboration Group-rel. PPO, joint reward \sim200 tok/s, +40–50% returns, coop@k gains
Joint Alignment (MOAT) (Zhu et al., 11 Sep 2025) Planning + Grounding Alternating DPO + SFT +3–4% on held-in/-out, reduces cap. gap
Temporal Contrastive RL (Zeng et al., 3 Jun 2024) Multi-agent RL transfer GCRL + InfoNCE subgoal discovery 4.6× sample efficiency over best baseline
Pathfinding DDG (Andreychuk et al., 30 Jun 2025) MAPF Delta-error data gen, BC mixed Outperforms 50×50\times models; 1M agents
Multiagent Debate (Subramaniam et al., 10 Jan 2025) LLM reasoning Per-agent FT on consensus +10–14 pts on MATH, enhanced reasoning diversity
CMAT (Liang et al., 2 Apr 2024) Small LLMs, agent tasks Multi-role RL, SFT, memory Parity with GPT-3.5 with <109<10^9 params
EF-OnTL (Castagna, 26 Jan 2025) RL online TL Expert-free buffer transfer, UE 3%+ global gain, faster convergence

Each framework builds on the core principle that agent interaction—via debate, alignment, memory integration, or error-driven transfer—expands the frontiers of what finetuned multiagent systems can achieve relative to single-agent or independently updated agents, and classic staged tuning approaches.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multiagent Finetuning.