MALT: Multi-Agent LLM Training

Updated 5 March 2026

MALT is a training paradigm that coordinates multiple LLM agents using reinforcement learning, addressing heterogeneity and communication challenges.
It leverages modular architectures and decentralized actor–critic methods to enhance efficiency, scalability, and policy specialization.
Empirical results show significant speedups, improved hardware utilization, and enhanced performance across reasoning, coding, and embodied tasks.

Multi-Agent LLM Training (MALT) encompasses a family of methods and systems for optimizing LLMs to act as coordinated agents within multi-agent reinforcement learning (MARL) paradigms. Unlike single-agent LLM training, MALT addresses heterogeneity, coordination, specialization, and communication challenges by developing robust system architectures, specialized credit assignment, and scalable reinforcement learning algorithms. The following sections summarize the core methodologies, algorithmic and systems-level advances, implementation challenges, empirical findings, and future directions of MALT in line with state-of-the-art research.

1. System Architectures for Multi-Agent LLM Training

The infrastructure underpinning MALT is driven by the need for efficient, scalable, and coordinated training of multiple LLM agents, each potentially running distinct or overlapping policies. FlexMARL exemplifies this trend by introducing a three-tiered modular architecture optimized for high-throughput, on-policy MARL at scale (Jiang et al., 10 Feb 2026):

Rollout Engine: Handles elastic, parallel rollout across agents with hierarchical load balancing (intra-agent min-heaps for queue management and inter-agent worker reallocation) to mitigate request skew.
Training Engine: Implements agent-centric process groups instantiated only upon demand (i.e., when sufficient micro-batch samples are available), with state and optimizer checkpointing to and from host RAM or remote nodes using unified Set/Get APIs.
Experience Store (ES): Central in-memory store holding per-agent data tables for prompt, response, policy versioning, and field-level status bits, supporting efficient data exchange between rollout and training and hiding synchronization barriers.

The architecture decouples rollout and training to fully utilize computational resources, achieves high hardware utilization (32.4% with FlexMARL vs. ≤16% for previous solutions), and handles heterogeneous agents (e.g., 14B and 32B models in the same deployment) (Jiang et al., 10 Feb 2026). Orchestration is further generalized to accommodate agent-wise scheduling, resource management, and dynamic agent–model assignment, supporting both homogeneous and heterogeneous LLM clusters.

2. Algorithmic Foundations and Credit Assignment

A fundamental component of MALT is the adaptation of RL credit assignment to the multi-agent, multi-role context. Core approaches include:

Group-Relative Policy Optimization (GRPO) and MAGRPO: GRPO computes group-centered, normalized advantages over a batch of rollouts per prompt. MAGRPO extends this to the MARL setting using group-based Monte Carlo returns for decentralized, partially observable environments. Each agent optimizes a PPO-style objective, while the group advantage ensures agents coordinate on shared rewards and avoid per-agent reward shaping (Liu et al., 6 Aug 2025).
Hierarchical and Nested Credit (MATPO, M-GRPO): In hierarchical systems with a planner and sub-agents (tool users), hierarchical group-relative normalization is applied to both levels, ensuring that planner and worker agents receive aligned, trajectory-level credit. M-GRPO further tackles asynchronous sub-agent invocation by trajectory-alignment (batching sub-trajectories to fixed size via duplication or truncation) and decoupled training pipelines to avoid inter-server gradient dependencies (Hong et al., 17 Nov 2025, Mo et al., 6 Oct 2025).
Agent-Wise Normalization and Stability (Dr. MAS): Instability in GRPO emerges when agents’ reward distributions diverge under global normalization. Dr. MAS remedies this by normalizing advantages per agent, bounding per-agent gradient norms and eliminating gradient spikes (Feng et al., 9 Feb 2026). Letting $\mathcal{Y}_k$ be steps where agent $k$ acts, per-agent mean $\mu_k$ and std $\sigma_k$ yield advantages $A^{i, k}_{\mathrm{agent}} = (R^i - \mu_k)/(\sigma_k+\epsilon)$ , ensuring uniform gradient conditioning.
Decentralized Actor–Critic Methods (CoLLM-CC/DC): Centralized critics accurately estimate value functions across joint agent histories (CoLLM-CC), yielding low-variance, stable learning—especially in long-horizon, sparse-reward tasks. Decentralized critics (CoLLM-DC) lower communication cost but degrade under high nonstationarity or delayed rewards (Liu et al., 29 Jan 2026).

3. Asynchronous and Parallelized Training Pipelines

MALT methodologies systematically break classical RL synchronization barriers by leveraging asynchronous, micro-batched training and parallel sampling:

Micro-Batch Driven Overlap: FlexMARL exploits the dominance of rollout latency ( $R_{\text{total}}\gg T_{\text{total}}$ ), overlapping rollout and training to reduce net wall-clock time to $L_{\mathrm{async}}\approx\max(R_{\text{total}}, T_{\text{total}})$ , formally preserving on-policy semantics by accumulating micro-batch gradients across $K$ batches before parameter version advancement (Jiang et al., 10 Feb 2026).
Elastic, Dependency-Driven Sampling: Rollout engines allocate inference workers adaptively with pooling and migration based on instantaneous queue disparity ( $\Delta Q$ ), keeping both inter- and intra-agent latency bounded and maximizing throughput.
On-Demand Process Groups: Training engine launches tightly-scoped process groups per agent only upon micro-batch readiness, implementing early termination and state checkpointing to optimize NPU and HBM memory utilization in large clusters.

This asynchronous orchestration is generalizable to multi-modal, multi-role agent settings and supports high agent count (≥15) on large clusters.

4. Specialization, Adaptation, and Emergent Behaviors

MALT acts as a catalyst for agent specialization and team-level adaptation. Representative approaches include:

Sequential Role Pipelines (MALT, MAPoRL): MALT implements a generate–verify–refine process employing independent, role-conditioned LLMs, with data and credit generated via synthetic search trees, value iteration, and off-policy supervised/DPO post-training (Motwani et al., 2024). MAPoRL adds a learned verifier for interaction-aware reward, explicitly shaping corrective and persuasive dialogue (Park et al., 25 Feb 2025).
Single-LLM Multi-Role Optimization (MATPO): MATPO unifies planner and worker roles within a single parameter LLM by prompt-switching and a joint, nested RL surrogate objective. Role-specific credit is propagated through a group-relative normalization aggregated over planner and worker rollouts (Mo et al., 6 Oct 2025).
Adaptation in Embodied Environments: The LIET framework combines per-agent utility function learning (local cost regression) and continual team-level evolution of communication tips, with both adaptation phases yielding improvements in multi-agent cooperation and efficient planning (Li et al., 8 Jun 2025).
Communication and Language-Guided Coordination: MALT frameworks may employ centralized LLM Coordinators for subgoal generation, decentralized communicators for interpretive messaging, and episodic LLM-Memory for strategy retrieval, as shown by LLM-MARL (Li, 1 Jun 2025). Learned gating networks optimize LLM query frequency to balance coordination efficacy and API cost.

5. Empirical Results and Benchmarks

MALT approaches have been validated across multiple domains, including abstract reasoning, coding, simulation-driven games, and embodied tasks:

System-Level Benchmarks: FlexMARL achieves up to $7.3\times$ speedup and up to $k$ 0 higher hardware utilization relative to prior art (MAS-RL, MARTI, DistRL), maintaining high throughputs ( $k$ 1 tps) and robustness to agent heterogeneity and count (Jiang et al., 10 Feb 2026).
Reasoning and Tool Use: MALT pipelines (generate–verify–refine) improve end-to-end accuracy by 7–16% (relative) on GSM8K, CSQA, MATH, and symbolic variants compared to baseline SFT (Motwani et al., 2024). MATPO yields 18.38% relative improvement over group-relative PPO in multi-turn, tool-integrated tasks (Mo et al., 6 Oct 2025).
Coding and Collaboration: MAGRPO attains 83–90% returns and significant pass@10 gains versus single-agent or independent-policy baselines in code generation tasks (HumanEval, CoopHumanEval), with decentralization delivering 3x faster response rates (Liu et al., 6 Aug 2025).
Game Environments and Control: LLM-MARL outperforms MAPPO, QMIX, and attention-only baselines on Google Research Football, MAgent, and StarCraft II, reaching 81–84% win rates and high coordination scores (≥0.89). Zero-shot generalization and ablations confirm the criticality of subgoal generation and communication modules (Li, 1 Jun 2025).
Stability and Efficiency: Dr. MAS achieves +5.6% to +15.8% improvements in avg@16 or pass@16, eradicating gradient spikes and scaling seamlessly to heterogeneous agent-model mixes (Feng et al., 9 Feb 2026).

6. Limitations, Open Problems, and Generalization

Despite accelerated progress, MALT methodologies exhibit the following limitations:

Infrastructure Requirements: High throughput and elasticity rely on disaggregated, RDMA-enabled clusters; performance may be constrained in non-RDMA or single-node environments (Jiang et al., 10 Feb 2026).
Batch and Synchronization Sensitivity: Hyperparameters such as micro-batch size ( $k$ 2), trajectory alignment ( $k$ 3), and batch scheduling must be carefully tuned to avoid pipeline stalls or communication overhead (Hong et al., 17 Nov 2025).
Credit Assignment in Complex Hierarchies: On-policy group-relative methods require careful version tracking, especially for off-policy or actor-critic variants, and global normalization can destabilize optimization (Dr. MAS addresses this partially) (Feng et al., 9 Feb 2026).
Sample Efficiency in Long-Horizon/ Sparse-Reward Regimes: Centralized critics (CoLLM-CC) or actor-critic methods outperform Monte Carlo group-based methods, but incur higher computational cost (Liu et al., 29 Jan 2026).
Generalizability: MALT methods generalize well across domains involving high-variance agent roles, asynchronous calls, and hierarchical pipelines, including codegen, debate, negotiation, multi-modal retrieval, and planning. However, robust generalization to open-ended, human-AI teaming and adversarial or fully decentralized regimes remains an active area of investigation.

7. Future Directions and Practitioner Recommendations

Emerging trends in MALT research include:

Meta-Prompt Learning and On-Device Distillation: Leveraging meta-learning to adapt prompt templates and reduce reliance on external LLM calls (Li, 1 Jun 2025).
Automated MAS Construction: Generative methods (e.g., MAS-GPT) that synthesize complete executable MAS (as Python code) from queries, achieving adaptive, low-latency agent deployment (Ye et al., 5 Mar 2025).
Hierarchical and Heterogeneous Agent Training: M-GRPO and Dr. MAS approaches readily enable heterogeneous deployments, reducing inference cost and latency while preserving accuracy (Hong et al., 17 Nov 2025, Feng et al., 9 Feb 2026).
Credit-Assignment Theoretic Advances: Continued refinement of advantage normalization, trajectory alignment, and decoupled optimization to maximize scalability and sample efficiency in real-world deployments.
Benchmarks and Evaluation Suites: Use diverse, open-domain benchmarks covering math, code, collaborative reasoning, and embodied environments, ensuring robustness and fair comparison across research efforts.

Practitioners are advised to select system architectures and RL algorithms based on environment regime (horizon length, reward sparsity), available cluster infrastructure, and agent heterogeneity. Emphasis should be placed on agent-wise normalization, micro-batch driven overlap, and asynchronous task orchestration to maximize system efficiency and stability. For tool and debate-heavy workflows, hierarchical multi-agent training with trajectory alignment substantially improves sample efficiency and policy specialization. For emerging domains, generative MAS construction and adaptation via meta-learning and gating networks offer promising directions for further generalization.

References:

(Jiang et al., 10 Feb 2026) Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning
(Liu et al., 6 Aug 2025) LLM Collaboration With Multi-Agent Reinforcement Learning
(Feng et al., 9 Feb 2026) Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems
(Li, 1 Jun 2025) Language-Guided Multi-Agent Learning in Simulations: A Unified Framework and Evaluation
(Motwani et al., 2024) MALT: Improving Reasoning with Multi-Agent LLM Training
(Hong et al., 17 Nov 2025) Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO
(Li et al., 8 Jun 2025) Learn as Individuals, Evolve as a Team: Multi-agent LLMs Adaptation in Embodied Environments
(Liu et al., 29 Jan 2026) Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
(Mo et al., 6 Oct 2025) Multi-Agent Tool-Integrated Policy Optimization
(Ye et al., 5 Mar 2025) MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems
(Park et al., 25 Feb 2025) MAPoRL: Multi-Agent Post-Co-Training for Collaborative LLMs with Reinforcement Learning