Offline Multi-Agent Reinforcement Learning

Updated 8 January 2026

Offline Multi-Agent Reinforcement Learning is a computational paradigm that trains multiple agents using pre-recorded trajectories to optimize joint policies without online exploration.
It addresses challenges like distributional shift, heterogeneous, and multi-modal datasets by leveraging techniques such as conservative Q-learning, diffusion-based modeling, and Partial Action Replacement.
Applications span urban traffic control, robotics, wireless networks, and multi-agent games, driving improvements in scalability, robustness, and real-world safety.

Offline Multi-Agent Reinforcement Learning (MARL) describes a computational paradigm through which multiple agents are trained to solve cooperative or competitive tasks using only a pre-existing dataset of environment interactions, without further sampling or exploration during learning. This setting circumvents the cost, risk, and impracticality of online data collection in complex domains—such as urban traffic signal control, robotics fleets, wireless networks, and multi-agent games—while introducing unique algorithmic and statistical challenges arising from the high-dimensionality of joint state-action spaces, multi-modal data distributions, and the non-stationarity intrinsic to multi-agent systems.

1. Problem Formulation and Dataset Characteristics

The mathematical foundation for offline MARL is the Decentralized Partially Observable Markov Decision Process (Dec-POMDP), in which $N$ agents, each with access to local observations $o_i$ , select actions $a_i$ from discrete or continuous spaces, forming a joint action $a=(a_1,\dots,a_N)$ . The environment transitions based on $P(s'|s,a)$ and emits a shared reward $r(s,a)$ or possibly individual rewards $r^i$ . Training takes place exclusively on an offline dataset $D=\{\tau_k\}_{k=1}^M$ , where each trajectory $\tau_k$ is a sequence of $(s_t,a_t,r_t,s_{t+1})$ , recorded under an unknown, often heterogeneous, behavior policy $\mu(a|s)$ .

The principal objective is to estimate a target joint policy $\pi(a|s)$ that maximizes expected discounted returns, using only the support provided by $D$ . Key difficulties include:

Distributional shift: Learned policies may query joint action combinations absent from $D$ , provoking severe extrapolation errors in value estimation.
Heterogeneous data: Real-world datasets mix trajectories from controllers of varying quality—expert, heuristic, random—complicating regularization and policy selection.
Multi-modality and non-stationarity: Offline data may encode diverse behavioral regimes, temporal variations in strategies, or multiple coordination equilibria.

Benchmark repositories, such as OG-MARL (Formanek et al., 2023), provide standardized suites with detailed characterizations (mean, variance, coverage) and diagnostic tools to facilitate controlled experimentation and reproducibility.

2. Distributional Shift and OOD Joint Actions

Distributional shift in offline MARL is fundamentally amplified by the exponential scaling of the joint action space with agent count, leading to a combinatorial explosion of states rarely or never seen in the data. Conventional importance sampling weights,

$\omega(s,a) = \pi(a|s) / \mu(a|s),$

can be numerically unstable or undefined for OOD regions, particularly in multi-agent settings. Methods such as OffLight (Bokade et al., 2024) mitigate this by employing behavior policy estimation via variational generative models and averaging importance weights across agents, rather than multiplying over all agents.

Partial Action Replacement (PAR) strategies (Jin et al., 10 Nov 2025) address OOD backup by updating only a subset of agents’ actions with the learned policy while others are held to the behavior policy, demonstrating that induced distribution shift scales linearly with the number of replaced agents in the factorized $\mu$ , rather than exponentially. Algorithms such as SPaCQL exploit uncertainty-adaptive mixtures of PAR-backed Q-targets to interpolate between stability and coordination.

Alternating best-response procedures (AlberDICE (Matsunaga et al., 2023)) optimize for stationary distributions via convex programs, ensuring each agent improves their own occupancy measure while the others are fixed, provably converging to Nash equilibria under regularization.

Offline MARL must account for datasets comprised of trajectories generated by policies of mixed quality, structure, and intent. Naive regularization may collapse multimodal empirical distributions, risking loss of critical behavior diversity necessary for coordination.

Generative modeling approaches (OffLight (Bokade et al., 2024), OMSD (Qiao et al., 9 May 2025)) leverage Gaussian Mixture Variational Graph Autoencoders (GMM-VGAE) or diffusion-score decomposition to recover latent modes in policy space. These components can disentangle distinct policy “regimes”: rule-based, greedy, learned expert, or noisy random, and encode them for transition re-weighting, targeted sampling, and behavior regularization. OMSD introduces sequential score function decomposition,

$\pi_b(a|s) = \pi_{b,1}(a_1|s) \pi_{b,2}(a_2|s,a_1) \cdots \pi_{b,N}(a_N|s,a_1,\dots,a_{N-1}),$

with per-agent score regularizers extracted from a trained diffusion model.

Reward decomposition frameworks (MACCA (Wang et al., 2023), SIT (Tian et al., 2022)) improve sample efficiency and robustness by partitioning global rewards into per-agent credits via causal Bayesian network inference or attention-based neural decomposition, facilitating prioritized experience replay and modular integration with existing RL methods.

4. Core Algorithmic Frameworks

Canonical offline MARL algorithms instantiate the following principles:

Conservative Q-Learning (CQL): Regularizes the critic by penalizing estimates on actions not supported by empirical data (Eldeeb et al., 1 Jan 2026, Eldeeb et al., 22 Jan 2025, Eldeeb et al., 2024).
Behavior Cloning (BC), BCQ, TD3+BC: Encourage the target policies to match the empirical distribution except when counteracted by value improvement signals (Formanek et al., 2023).
Sequential/Alternating Optimization: InSPO (Liu et al., 2024), AlberDICE (Matsunaga et al., 2023) avoid joint policy updates that risk OOD actions by optimizing over individual agents sequentially, with provable convergence to quantal response or Nash equilibria.
Diffusion-Based Policy/Value Modeling: Methods like DOM2 (Li et al., 2023), EAQ (Oh et al., 2024), OMSD (Qiao et al., 9 May 2025) learn generative distributions over trajectories or policies—parameterized via score networks or denoising DPMs—to enhance expressiveness, sample diversity, and robustness to coverage gaps.
Prioritized Sampling: OffLight (Bokade et al., 2024) and DOM2 (Li et al., 2023) boost learning from high-reward episodes using prioritized replay, either by return-based sampling or trajectory-level duplication.
Meta-Learning: MAML-based “meta-offline MARL” (Eldeeb et al., 27 Jan 2025, Eldeeb et al., 1 Jan 2026) supports rapid adaptation to new tasks or objectives via initialization learned across task families, shown to accelerate fine-tuning in dynamic wireless or UAV scenarios.

Example pseudocode for OffLight training is:

Train GMM-VGAE on D;
For episode k, compute return G_k, RBPS weight w_R^k;
Store { \hat π_bᵢ, w_R^k } for all transitions;
for iteration in range(T):
    Sample minibatch with probability ∝ w_R^k;
    Compute IS weight w_I across agents;
    w ← clamp(normalize(w_I · w_R^k));
    Compute RL loss with weight w;
    Update θ, φ via gradient descent;
until convergence

(source: (Bokade et al., 2024))

5. Data-Centric Foundations and Benchmarking

Dataset curation, coverage diagnostics, and reproducibility are essential for meaningful algorithmic comparisons (Formanek et al., 2024, Formanek et al., 2023). Data-centric methodologies emphasize:

Standardized dataset formats (Vault repositories), API access, metadata documentation (reward means, distributions, coverage metrics).
Quantitative metrics such as Joint-SACo (unique state-action pair ratio), episode-return histograms, and coverage plots.
Analysis tools for benchmarking, subsampling, dataset mixing, and scenario standardization.
Empirical demonstrations show that policy performance is tightly coupled to dataset characteristics—mean, variance, return distribution shape, and support—often dominating observed differences between algorithms.

Tables of benchmark environments, dataset types, and algorithmic features are documented in (Formanek et al., 2023, Formanek et al., 2024), facilitating best-practice guidelines for offline MARL research.

6. Scalability, Generalization, and Application Domains

Offline MARL frameworks—particularly those employing retention- or auto-regressive sequence models (Oryx (Formanek et al., 28 May 2025)) and value-decomposition architectures (CTDE, QMIX)—are shown to scale gracefully with agent number (tested up to 50 agents), horizon length, and degree of multi-modality. Scalability is enabled by linear-memory retention blocks and per-agent factorized updates, while generalization benefits stem from data augmentation strategies (diffusion-guided episode generation, prioritized replay) and meta-learned initializations. Key applications verified experimentally include:

Traffic signal control with hundreds of intersections (Bokade et al., 2024).
UAV path planning and wireless radio resource management (Eldeeb et al., 27 Jan 2025, Eldeeb et al., 22 Jan 2025, Eldeeb et al., 1 Jan 2026).
Benchmarks in multi-agent particle environments, StarCraft II, MuJoCo, and grid-world navigation (Formanek et al., 2023, Qiao et al., 9 May 2025, Formanek et al., 28 May 2025).

Reported empirical results substantiate substantial improvements over prior baselines in average travel time, queue length, convergence speed, and adaptation robustness (e.g., OffLight: 7.8% reduction in travel time, 11.2% decrease in queue length over TD3+BC and CQL baselines (Bokade et al., 2024)).

7. Open Challenges and Future Directions

Multiple avenues for future work remain prominent:

Further mitigation of distributional shift, especially in the presence of correlated or adversarial behavior policies.
Robustness to reward-poisoning attacks, dataset bias, and adversarial data modification (Wu et al., 2022).
Integration of causal inference, temporal credit assignment, and counterfactual reasoning for interpretable reward decomposition (Wang et al., 2023, Tian et al., 2022).
Expanding generative models to continuous control, very long horizons, and hybrid offline-online adaptation loops (Oh et al., 2024, Li et al., 2023).
Meta-learning extensions for continual adaptation in non-stationary multi-agent environments with evolving objectives (Eldeeb et al., 27 Jan 2025, Eldeeb et al., 1 Jan 2026).
Deployment in real-world domains where safety and resource constraints exclude online exploration, such as 6G wireless, smart cities, and large-scale robotics fleets.

Algorithmic advances are coupled with a call for standardized data protocols, explainable AI for multi-agent systems, and integration of deep generative and foundation models for semantic context and decision support (Eldeeb et al., 1 Jan 2026). This suggests the field is converging towards highly flexible, data-driven, and safety-conscious methodologies rooted in rigorous probabilistic modeling and scalable optimization.

References

OffLight: An Offline Multi-Agent Reinforcement Learning Framework for Traffic Signal Control (Bokade et al., 2024)
Diffusion-based Episodes Augmentation for Offline Multi-Agent Reinforcement Learning (Oh et al., 2024)
Offline Multi-agent Reinforcement Learning via Score Decomposition (Qiao et al., 9 May 2025)
Multi-Agent Meta-Offline Reinforcement Learning for Timely UAV Path Planning and Data Collection (Eldeeb et al., 27 Jan 2025)
Off-the-Grid MARL: Datasets with Baselines for Offline Multi-Agent Reinforcement Learning (Formanek et al., 2023)
A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem (Barde et al., 2023)
MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment (Wang et al., 2023)
Oryx: a Performant and Scalable Algorithm for Many-Agent Coordination in Offline MARL (Formanek et al., 28 May 2025)
Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning (Formanek et al., 2024)
Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization (Liu et al., 2024)
AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation (Matsunaga et al., 2023)
Reward Poisoning Attacks on Offline Multi-Agent Reinforcement Learning (Wu et al., 2022)
Beyond Conservatism: Diffusion Policies in Offline Multi-agent Reinforcement Learning (Li et al., 2023)
ComaDICE: Offline Cooperative Multi-Agent Reinforcement Learning with Stationary Distribution Shift Regularization (Bui et al., 2024)
An Offline Multi-Agent Reinforcement Learning Framework for Radio Resource Management (Eldeeb et al., 22 Jan 2025)
Conservative and Risk-Aware Offline Multi-Agent Reinforcement Learning (Eldeeb et al., 2024)
Efficient Communication via Self-supervised Information Aggregation for Online and Offline Multi-agent Reinforcement Learning (Guan et al., 2023)
Learning from Good Trajectories in Offline Multi-Agent Reinforcement Learning (Tian et al., 2022)
Partial Action Replacement: Tackling Distribution Shift in Offline MARL (Jin et al., 10 Nov 2025)
Offline Multi-Agent Reinforcement Learning for 6G Communications: Fundamentals, Applications and Future Directions (Eldeeb et al., 1 Jan 2026)