Quantum Multi-Agent Reinforcement Learning

Updated 9 March 2026

Quantum Multi-Agent Reinforcement Learning (QMARL) is a framework that integrates quantum computing with multi-agent reinforcement learning to improve parameter efficiency, scalability, and policy performance.
It employs variational quantum circuits and quantum-specific training algorithms to mitigate challenges like non-stationarity and the curse of dimensionality in complex environments.
Empirical results show that QMARL achieves state-of-the-art performance with reduced parameter counts and faster convergence in cooperative and decentralized tasks.

Quantum Multi-Agent Reinforcement Learning (QMARL) refers to the extension of reinforcement learning protocols to scenarios involving multiple agents, where quantum computing and quantum information processing techniques are leveraged to enhance parameter efficiency, policy expressivity, convergence speed, and scalability in cooperative and competitive decentralized systems. QMARL architectures typically instantiate agents with variational quantum circuits (VQCs) or quantum neural networks, employing quantum-specific circuit design, measurement, and training mechanisms. They address fundamental obstacles in classical multi-agent reinforcement learning (MARL), including non-stationarity, the curse of dimensionality, and resource limitations in the noisy intermediate-scale quantum (NISQ) regime (Yun et al., 2022).

1. Multi-Agent QRL Methodology and Frameworks

QMARL formalizes the multi-agent control problem as a Markov game or decentralized (partially–observable) Markov decision process (Dec-MDP/Dec-POMDP). For $N$ agents, each indexed by $n=1,\dots,N$ , the joint state $s_t$ is a composite of local observations $o^n_t$ , and each agent selects an action $u_t^n \sim \pi_{\theta^n}(\cdot|o_t^n)$ , forming the joint action $u_t = (u_t^1,\dots,u_t^N)$ . A central reward signal $r(s_t, u_t)$ is provided, and the objective is global return maximization:

$J(\pi) = E_\pi\left[\sum_{t=0}^T \gamma^t r(s_t, u_t)\right],\quad \gamma \in [0,1)$

A dominant paradigm is centralized training with decentralized execution (CTDE): decentralized quantum actors operate based on local observations, each parameterized by an agent-specific variational quantum circuit (VQC, parameters $\theta^n$ ), while a centralized quantum critic approximates the global state-value function via a separate VQC with parameters $\psi$ (Yun et al., 2022). During training, transitions $n=1,\dots,N$ 0 are collected, and temporal-difference errors are computed:

$n=1,\dots,N$ 1

The actor and critic losses are: \begin{align*} L_{\text{actor}} &= - E_\pi \bigg[ \sum_{t=1}^T \sum_{n=1}^N \delta_t \cdot \log \pi_{\theta^{n}(u_t^n|o_tⁿ⁾} \bigg] \ L_{\text{critic}} &= E \left[ \sum_{t=1}^T \delta_t² \right] \end{align*} Updates are propagated through both quantum and classical layers.

The CTDE methodology mitigates non-stationarity, stabilizes gradient flow, and allows the quantum critic to supply a global, informative training signal even under severe parameter constraints and noisy intermediate hardware (Yun et al., 2022, Yun et al., 2023).

2. Variational Quantum Circuit Architectures for Multi-Agent RL

The core computational module in QMARL is the VQC or QNN. Each agent's policy is parameterized as:

State Encoding $n=1,\dots,N$ 2: A mapping of $n=1,\dots,N$ 3-dimensional classical input vector $n=1,\dots,N$ 4 to single-qubit rotation angles, e.g., $n=1,\dots,N$ 5 per qubit.
Variational Layers $n=1,\dots,N$ 6: Alternating parameterized single-qubit rotations (e.g., $n=1,\dots,N$ 7) and controlled-NOT entanglers, arranged in $n=1,\dots,N$ 8 layers.
Measurement $n=1,\dots,N$ 9: Measurement of, e.g., $s_t$ 0 on each qubit or their average; output vectors are mapped to action distributions via classical softmax heads.

Typical configurations allocate $s_t$ 1 qubits and variational depth $s_t$ 2 per circuit (Yun et al., 2022). The centralized critic receives the joint state and processes it through an analogous (but parameter-distinct) circuit. Resource optimization is critical, with architectural choices aimed at encoding high-dimensional observation spaces compactly and enabling sublinear growth of qubit requirements with agent count (Kölle et al., 2023, Yun et al., 2022).

Evolutionary circuit search strategies (gate-based, layer-based, and prototype-based) have also been applied to automatically optimize VQC architectures for multi-agent tasks, with gate-based mutation-only approaches yielding the highest performance and most compact circuits (Kölle et al., 2024, Kölle et al., 2023).

3. Quantum-Specific Training Algorithms and Gradient Estimation

Variational quantum circuits employ specialized quantum-compatible optimization:

Parameter-Shift Rule: Derivatives of VQC expectation values with respect to latent parameters $s_t$ 3 are computed as:

$s_t$ 4

which enables sample-efficient, hardware-friendly quantum gradients (Yun et al., 2022, Yun et al., 2023).

Evolutionary Algorithms: In parameter space with barren plateaus where gradients vanish, gradient-free classical evolutionary strategies optimize VQC agent policies. Mutation-only evolutionary schemes, periodically injecting Gaussian noise into parameter vectors, are consistently effective at traversing flat landscapes in high-dimensional VQC optimization (Kölle et al., 2023, Kölle et al., 2024).
Meta-learning Approaches: Some QMARL systems introduce dual-parameter decompositions—separating circuit unitary "angle" and measurement "pole" parameters—and two-stage (meta and local) adaptation. This supports meta-generalization across tasks (angle domain) and rapid few-shot adaptation (pole domain), regularized via angle-to-pole noise injection (Yun et al., 2022).

4. Scalability, Parameter Efficiency, and the Curse of Dimensionality

Quantum parametrizations offer a logarithmic action-space encoding advantage via projective measurement:

A $s_t$ 5-qubit VQC supports $s_t$ 6 orthogonal actions—that is, the projection value measure (PVM) approach reduces exponential action spaces to $s_t$ 7 qubits, overcoming an intractable limitation for classical MARL function approximators (Park et al., 2023, Kim et al., 2024).
Empirical results: QMARL with just ~110 parameters matches or exceeds the performance of classical neural networks with up to $s_t$ 8 parameters in cooperative scheduling, queue management, and resource allocation under identical training budgets (Yun et al., 2022, Park et al., 2023, Kim et al., 2024).
As the size of the environment and number of agents grows, well-designed QMARL architectures maintain sublinear resource scaling due to encoding and quantum superposition, provided careful ansatz depth and error-mitigation techniques are applied (Yun et al., 2022, Kölle et al., 2023).

5. Coordination, Communication, and Entanglement-Enabled Strategies

Quantum information theory introduces coordination resources in multi-agent scenarios:

Entanglement-Enabled Coordination: Shared quantum entanglement as a correlating device enables classes of correlated policies strictly richer than those achievable using shared randomness; these can be exploited in Dec-POMDPs to attain quantum advantage in both nonlocal-game (single-round) and sequential settings (Gardiner et al., 9 Feb 2026). The architecture includes a quantum coordinator network, decentralized local actors, and differentiable parameterization of measurement (POVM) matrices. Quantum measurements conditioned on individual agent histories and shared entangled resources yield joint action distributions unattainable classically.
Distributed Entangled Critics: The eQMARL framework distributes the quantum critic across agents and leverages $s_t$ 9-entangled qubit inputs over a quantum channel. Agents perform localized VQC transformations and return qubits for global measurement, removing the need for classical observation sharing and reducing communication overhead by factors of up to $o^n_t$ 0. This achieves up to $o^n_t$ 1 faster convergence and higher scores compared to fully centralized or classical split-baselines (DeRieux et al., 2024).
Communication Protocols: Explicit message passing (MATE, MEDIATE, token-exchange protocols) can be efficiently implemented with quantum Q-learning agents, facilitating emergent cooperation even in classical social dilemmas. Communication via token-based (temporal-difference monotonic-improvement) schemes significantly outperforms variants based on ad-hoc gifting or unstructured discrete messages (Kölle et al., 26 Jan 2026).

6. Empirical Results and Benchmark Evaluations

QMARL architectures have been validated across several benchmark environments:

Single-Hop Edge-Cloud Offloading: The CTDE-QMARL framework surpasses classical MARL by 57.7 points in normalized return under identical parameter budgets, reaching $o^n_t$ 2 of optimal “achievability,” even matching the performance of overparameterized classical networks (Yun et al., 2022).
Coin Game Multi-Agent Task: VQCs (8 layers, 6 qubits) trained by EA reach state-of-the-art neural-network–level scores ( $o^n_t$ 3) with only 148 parameters (versus 6788 for large neural nets), a $o^n_t$ 4 reduction (Kölle et al., 2023, Kölle et al., 2024).
Aerial Ad-hoc Networks: Hybrid quantum–classical MAPPO with a quantum critic demonstrates faster and slightly superior convergence relative to classical PPO, with sample-efficiency gains of ~ $o^n_t$ 5 (Drăgan et al., 2024).
Space-Air-Ground Integrated Networks: QMARL with PVM action encoding achieves $o^n_t$ 6 normalized reward, outperforming classical policy-gradients and DQN (which collapse at large $o^n_t$ 7), confirming practical quantum speedup in scenarios with exponential scheduling dimensions (Kim et al., 2024).
Quantum Architecture Search: Multi-agent RL accelerates quantum architecture search by $o^n_t$ 8 (episodes to target reward) and yields circuits with $o^n_t$ 9 fewer CNOTs compared to QAOA baselines (Sergeev et al., 27 Nov 2025).

7. Limitations, Open Problems, and Future Directions

While QMARL demonstrates clear parameter and sample efficiency, several critical challenges persist:

NISQ-Era Hardware Constraints: Noise and decoherence in physical qubits restrict expressivity, limiting circuit depth and thereby function approximation capacity (Yun et al., 2022, Yun et al., 2023).
Optimization Barriers: Barren plateaus remain a barrier for gradient-based QRL/MARL; population-based, gradient-free methods partially mitigate but require additional computational resources (Kölle et al., 2023, Kölle et al., 2024).
Scalability: While encoding and architectural advances support sublinear system growth, experiments beyond moderate agent or action-space sizes are simulation-based; direct demonstration on NISQ devices at scale is an open research target (Kölle et al., 2023, Drăgan et al., 2024).
Entanglement Distribution: Realistic distributed entanglement channels and measurement non-idealities must be incorporated for deployment in physical multi-agent systems (DeRieux et al., 2024, Gardiner et al., 9 Feb 2026).
Unification with Classical Algorithms: Hybrid quantum-classical pipelines (quantum critics + classical actors or vice versa) show promise for improved expressivity and noise robustness, but the optimal partitioning and integration remain to be fully formalized and empirically characterized (Drăgan et al., 2024, Taghavi et al., 25 Nov 2025).

Ongoing development focuses on noise-robust, hardware-efficient ansätze, communication-efficient entangled protocols, hybrid optimization methods (meta-learning, evolutionary search), and the systematic study of quantum advantage in decentralized coordination (Yun et al., 2022, Gardiner et al., 9 Feb 2026, Taghavi et al., 25 Nov 2025).

Selected references: (Yun et al., 2022, Yun et al., 2022, Kölle et al., 2023, Park et al., 2023, Yun et al., 2023, Drăgan et al., 2024, Taghavi et al., 25 Nov 2025, Sergeev et al., 27 Nov 2025, Kölle et al., 2024, DeRieux et al., 2024, Gardiner et al., 9 Feb 2026, Kölle et al., 26 Jan 2026, Kim et al., 2024).