Multi-Agent Reinforcement Fine-Tuning (MARFT)

Updated 23 March 2026

MARFT is a framework that extends reinforcement learning fine-tuning to multi-agent systems operating in decentralized, partially observable environments.
It leverages transfer, coordination, and profile-aware constraints by integrating techniques like model-based offline methods, cooperative PPO variants, and metric-oriented fine-tuning.
Its robust design facilitates efficient policy transfer and rapid adaptation in applications such as collaborative LLM systems, traffic control, and multi-agent robotics.

Multi-Agent Reinforcement Fine-Tuning (MARFT) comprises a family of methodologies for fine-tuning multi-agent systems—often composed of pretrained large models such as LLMs or deep RL agents—using reinforcement learning objectives that target cooperative or competitive tasks in decentralized environments. MARFT extends classical multi-agent reinforcement learning (MARL) and single-agent reinforcement fine-tuning (RFT) paradigms to settings requiring trial-and-error adaptation, explicit coordination, transfer, and robustness. It provides a general framework for optimizing multi-agent systems using task rewards, often subject to application-specific constraints such as profile-awareness, language fluency, transfer efficiency, and alignment with human preferences.

1. Formal Definitions and Core Problem Settings

MARFT operates within the formalism of partially observable stochastic games (POSGs) or decentralized partially observable Markov decision processes (Dec-POMDPs), with each agent $i$ parameterized by its own or shared policy $\pi^i_{\theta^i}$ . In the context of LLM-based Multi-Agent Systems (LaMAS), this is further generalized in the Flex-POMDP formalism as

$\langle \mathcal{V},\,\mathcal{N},\,\mathcal{S},\,\boldsymbol{\mathcal{O}},\,\boldsymbol{\mathcal{A}},\,\mathcal{T},\,\mathcal{R},\,\gamma,\,\mathcal{D} \rangle$

where

$\mathcal{N}$ is the set of agents,
$\mathcal{S}$ is the state space,
$\boldsymbol{\mathcal{O}}$ , $\boldsymbol{\mathcal{A}}$ are the joint observation/action spaces,
$\mathcal{T}$ is the transition kernel,
$\mathcal{R}$ is the reward function,
$\gamma$ is the discount factor, and
$\pi^i_{\theta^i}$ 0 encodes cross-agent dependencies or orchestration (Liao et al., 21 Apr 2025).

The MARFT objective is to maximize the expected global return:

$\pi^i_{\theta^i}$ 1

subject to constraints that preserve pretrained capabilities (e.g., KL-regularization for LLMs). Unlike conventional MARL, MARFT often incorporates asynchronous or profile-aware agents, complex dependency functions, and heterogeneity at both the model and action levels (Liao et al., 21 Apr 2025, Liu et al., 6 Aug 2025, Ma et al., 2024).

2. Algorithmic Frameworks and Paradigms

MARFT encompasses a diverse set of algorithmic instantiations reflecting both the transfer and fine-tuning aspects unique to multi-agent systems:

Model-Based Offline MARFT: Synthetic experience generation using learned world models (e.g., MOMA-PPO) to solve offline coordination and fine-tuning, handling strategy agreement and adaptation under partial observability with penalized rollouts for epistemic uncertainty (Barde et al., 2023).
Sequential and Cooperative PPO: Extensions of PPO (MAPPO, HAPPO, MAGRPO, CORY) to multi-agent domains, facilitating coordinated or role-structured updates. In CORY, two LLM agents coevolve as pioneer and observer, exchanging roles and utilizing collective rewards to mitigate collapse and improve Pareto-frontier trade-offs (Ma et al., 2024).
Metric-Oriented Fine-Tuning (“SFT–RFT–SFT” pipelines): Alternating supervised and RL-based phases tailored for next-token prediction models in multi-agent simulators, with metric-oriented policy optimization (MPO) directly aligning agent behavior with human-centric evaluation criteria (Pei et al., 28 Sep 2025).
Multi-Agent Policy Option Transfer: MAPTF models transfer among agents as an option-learning problem, leveraging successor representations (SRO) to stabilize policy transfer under partial observability (Yang et al., 2020).

The table below summarizes representative MARFT algorithmic directions:

Approach	Policy Update Mechanism	Key Innovations	Primary Domain
MOMA-PPO (Barde et al., 2023)	Model-based MAPPO, uncertainty masking	Offline coordination/fine-tuning, synthetic rollouts	Robotics, MuJoCo, games
MAGRPO (Liu et al., 6 Aug 2025)	Group-rel. PPO, centralized adv.	Cooperative multi-turn LLM tasks	LLM collaboration, code/writing
CORY (Ma et al., 2024)	Dual-agent PPO, coevolution	Role exchange, joint reward, KL control	LLM policy fine-tuning
SMART-R1 (Pei et al., 28 Sep 2025)	MPO, iterative SFT/RFT phases	Closed-loop metric optimization, “SFT-RFT-SFT”	Traffic simulation, foundation models
MAPTF (Yang et al., 2020)	Option-based transfer, SRO	Adaptive intra-system transfer, partial obs.	Homogeneous multiagent RL

3. Theoretical Analyses: Sample Efficiency and Task Decomposition

A crucial question tackled by recent works is: when does MARFT achieve statistical or computational advantages over single-agent RL fine-tuning (SARL/RFT)? The PAC learning analysis in (Su et al., 9 Feb 2026) formalizes this, showing:

If a task decomposes into K independent subtasks (alignment factor $\pi^i_{\theta^i}$ 2), MARFT possesses

$\pi^i_{\theta^i}$ 3

sample complexity reduction, scaling with the hardest subtask’s parameter dimension.

When subtasks are dependent, error propagation and objective misalignment (quantified by $\pi^i_{\theta^i}$ 4) erode this advantage. MARFT remains preferable iff $\pi^i_{\theta^i}$ 5.
Practical pipeline: segment tasks, empirically estimate $\pi^i_{\theta^i}$ 6, and prefer MARFT when subtasks are nearly independent and agent parameterization is efficiently structured (Su et al., 9 Feb 2026).

These theoretical guidelines are directly relevant for complex agentic LLM pipelines (e.g., planner–solver–verifier, multi-stage chain-of-thought), informing how to design architectures and rewards for MARFT success.

4. Empirical Implementations and Applications

MARFT has been instantiated across disparate application classes:

LLM-based Multi-Agent Systems (LaMAS): The MARFT framework (Liao et al., 21 Apr 2025) leverages Flex-POMDP for profile-aware, asynchronous agentic workflows (code generation, scientific reasoning, presentation slide preparation). Engineering considerations include distributed rollout, adapter isolation, and monotonic sequential updates via advantage decomposition.
Traffic Simulation and Control: MARFT is used for efficient, robust tuning in hierarchical control of transportation networks and traffic simulators, where each RL agent tunes actuator gains for local control policies. Multi-agent fine-tuning achieves resilience to partial agent failures and matches centralized RL performance (Önür et al., 8 Dec 2025, Pei et al., 28 Sep 2025).
StarCraft SMAC and Multi-Agent Robotics: Transferrable and scenario-independent policies are achieved via state abstraction (Influence Map encoding), curriculum transfer, and online fine-tuning mechanisms—e.g., OVMSE and scenario-independent A2C (Nipu et al., 2024, Zhong et al., 2024). These enable rapid adaptation to new environments and tasks with minimal per-scenario engineering.
Collaborative LLM Writing and Coding: Algorithms such as MAGRPO (Liu et al., 6 Aug 2025) and CORY (Ma et al., 2024) demonstrate significant improvements over single-agent RL baselines in writing/coding tasks, yielding faster learning and higher-quality collaborative completions.

5. Robustness, Transfer, and Coordination Challenges

MARFT methods address key practical challenges:

Robustness: Decentralized approaches (e.g., OVMSE (Zhong et al., 2024), decentralized controllers (Önür et al., 8 Dec 2025)) ensure continued operation even under agent/node failures or sensor corruption, outperforming monolithic RL in adverse conditions.
Efficient Transfer: MAPTF (Yang et al., 2020) and scenario-independent encodings (Nipu et al., 2024) enable intra- and inter-agent policy reuse, accelerating convergence and stability, particularly in partially observable settings.
Coordination Under Partial Observability: Model-based rollouts and joint reward structures provide a mechanism for agents to learn implicit conventions and cooperative equilibria without explicit communication channels (Barde et al., 2023, Liu et al., 6 Aug 2025).
Alignment and Safe Adaptation: R1-style pipelines (SMART-R1 (Pei et al., 28 Sep 2025)) and Flex-POMDP-based MARFT tightly integrate RL fine-tuning with metric-based or profile-based constraints, maintaining domain-specific desiderata (e.g., task success, human-preference alignment, language fluency).

6. Open Research Directions and Limitations

MARFT research highlights both promising potentials and key limitations:

Dynamic and Heterogeneous Environments: Standardized, LLM-friendly multi-step agentic benchmarks are lacking; heterogeneity in architecture, reward, and agent roles adds complexity (Liao et al., 21 Apr 2025).
Sample Efficiency: On-policy MARFT remains resource intensive; integrating off-policy methods or synthetic trajectory generation is an open avenue (Barde et al., 2023, Zhong et al., 2024).
Reward Design and Multi-Objective Optimization: Most frameworks use single-metric or handcrafted reward structures. Extending MARFT to multi-criteria domains or learning reward models remains largely open (Pei et al., 28 Sep 2025).
Stability and Scalability: As the agent count or horizon grows, reward sparsity and variance hinder scaling (Liu et al., 6 Aug 2025).
Inter-agent Communication and Value Decomposition: Extension to value-decomposition methods (VDN, QMIX, QPLEX), learned communication, or auto-curricula is a natural progression (Önür et al., 8 Dec 2025, Liao et al., 21 Apr 2025).

Potential avenues include development of unified MARFT toolkits (integrating MARL and LLM RLHF infrastructures), hybridizing model-based and model-free paradigms, and deriving tight theoretical sample complexity and convergence guarantees for broader classes of multi-agent environments.

7. Representative Results

MARFT has demonstrated the following empirical gains in benchmark settings:

Domain	Method	Performance/Metric	Key Finding
LLM Writing/Code	MAGRPO (Liu et al., 6 Aug 2025)	94%–93% return in writing; 84%–88% coding	Outperforms single-agent/multi-turn baselines
Traffic Control	MARFT/Decentralized RL (Önür et al., 8 Dec 2025)	TTS 7544.1 veh·h (best), robust under noise	Centralized RL is less robust, fixed PI much worse
Offline MARL	MOMA-PPO (Barde et al., 2023)	>20 pts over baselines in Ant/Reacher	Only model-based approach solves both strategy agreement/fine-tuning
SMAC RL	OVMSE (Zhong et al., 2024)	+40–80% win-rate gains over MACQL, QMIX	Minimal drop at offline→online, fastest adaptation

These results confirm that MARFT, when tailored to domain-specific communication, observation, and reward structures, consistently yields enhanced coordination, robustness, and sample efficiency over classical RL and MARL approaches.

MARFT unifies principles from MARL, RL fine-tuning, transfer, and collaborative LLM systems, offering a scalable and theoretically-grounded foundation for the next generation of multi-agent artificial intelligence (Liao et al., 21 Apr 2025, Barde et al., 2023, Liu et al., 6 Aug 2025, Ma et al., 2024, Önür et al., 8 Dec 2025, Zhong et al., 2024, Su et al., 9 Feb 2026, Pei et al., 28 Sep 2025, Yang et al., 2020, Nipu et al., 2024).