Multi-Agent Ensemble-Assisted DRL

Updated 21 December 2025

The paper introduces an architecture that combines localized deep reinforcement learning with a global ensemble mechanism to achieve coordinated decision-making.
It employs privacy-preserving federated gradient updates and imitation acceleration, ensuring secure and scalable training across distributed agents.
Empirical results demonstrate significantly faster convergence and improved resource scheduling compared to traditional isolated or classic multi-agent DRL approaches.

A multi-agent ensemble-assisted deep reinforcement learning (DRL) architecture combines distributed agents, each operating with localized observation and/or action spaces, with ensemble methods for global coordination. In such systems, multiple agents collaborate—leveraging ensemble mechanisms or shared global parameter backbones—to tackle complex, high-dimensional environments, typically with privacy, scalability, or efficiency constraints. Two prominent instantiations are the privacy-preserving global-local DQN framework for cooperative DRL (Shi, 2021) and the distributed resource scheduling framework for MEC systems with ensemble-assisted multi-agent DRL and imitation acceleration (Jiang et al., 2020). These architectures address the challenge of coordinating learning and decision-making across many agents in environments with state and reward heterogeneity, limited communication, and practical privacy/security requirements.

1. Architectural Foundations

Ensemble-assisted multi-agent DRL architectures can be structured according to how agents process information and interact:

In the privacy-preserving ensemble model (Shi, 2021), each agent maintains a local neural network tailored to its environment and a global network shared among all agents. The local net captures agent-specific features, while the global net encodes common structure across environments or tasks. Forward passes are serial: local feature extraction, shared global transformation, and local output head.
In the MEC resource scheduling ensemble (Jiang et al., 2020), one agent is deployed per edge server. Each observes only local channel state information (CSI), reducing the per-agent input dimension from $\mathcal{O}(NM)$ to $\mathcal{O}(N)$ . Each agent’s distinct DNN outputs per-device offloading probabilities, and a global ensemble mechanism aggregates those outputs to determine the system-wide resource allocation.

The table below summarizes key architectural distinctions:

Architecture	Local Model Role	Global/Ensemble Mechanism
Privacy-preserving DQN (Shi, 2021)	Env.-specialization	Shared global DNN, federated/secure aggregate
MEC Scheduling (Jiang et al., 2020)	Agent per MEC node	Score-based ensemble voting

2. Observation-to-Action Mapping and Ensemble Protocols

In serial ensemble models (Shi, 2021), with an agent's observation $s_t$ :

The local feature extractor $f_i^{(1)}(s_t; \theta_i^{(1)})$ produces agent-specific features.
Shared global extractor $f_g(\cdot; \theta_g)$ propagates shared transformations.
The local head $f_i^{(2)}(\cdot; \theta_i^{(2)})$ outputs Q-values or policy logits.

Formally,

$h_i^{(1)} = f_i^{(1)}(s_t; \theta_i^{(1)}),\quad h_g = f_g(h_i^{(1)};\theta_g),\quad q_t = f_i^{(2)}(h_g; \theta_i^{(2)}),$

where $q_t \in \mathbb{R}^{|A_i|}$ .

For the ensemble-MEC system (Jiang et al., 2020):

Each agent $j$ maps its state $s_{j,t}$ to per-device probabilities $q_{ij} = \pi_j(a_{ij}=1|s_j)$ .
The global offloading action for device $i$ uses a highest-vote operator:

$a_i^e = \begin{cases} \operatorname{argmax}_{j} q_{ij} & \text{if}~\max_j q_{ij} \geq 0.5 \ 0 & \text{otherwise} \end{cases}$

3. Federated Learning, Privacy, and Gradient Aggregation

Privacy and scalability are addressed by federated aggregation of updates (Shi, 2021):

Each agent computes gradients for the local ( $\theta_i$ ) and global ( $\theta_g$ ) components with respect to the loss $L_i(\theta_g, \theta_i)$ :

$g_i^i = \nabla_{\theta_i} L_i,\quad g_i^g = \nabla_{\theta_g} L_i$

Only the global gradient $g_i^g$ is shared, encrypted, via an untrusted “Black Board.” No raw data or $\theta_i$ leaves the agent.
The Black Board aggregates global gradients: $G^g = \sum_{i} g_i^g$ , broadcasting the aggregate for synchronized updates:

$\theta_g \leftarrow \theta_g - \eta_g G^g$

Local parameters $\theta_i$ are updated with local gradients only.

This federated scheme prevents cross-agent privacy leakage while allowing shared abstraction learning.

4. Loss Functions, Regularization, and Learning Dynamics

Losses are decomposed to support agent specialization and collaboration:

In privacy-preserving ensemble DQN (Shi, 2021):
- Per-agent loss: $L_i(\theta_g, \theta_i) = \text{DQN-MSE}(\theta_g, \theta_i) + \lambda_i \|\theta_i\|_2^2$
- Global objective: $L_g(\theta_g) = \sum_i \alpha_i L_i(\theta_g, \theta_i) + \mu\|\theta_g\|_2^2$
- No regularization couples the $\theta_i$ between agents.
In ensemble-MEC DRL (Jiang et al., 2020):
- Imitation pre-training: $L_1(\theta) = L_D(\theta) + \lambda_1\|\theta\|_2^2$ , with $L_D$ cross-entropy on demonstration data
- Joint DRL loss: $L_2(\theta) = L_1(\theta) + \lambda_2 L_A(\theta)$ , with $L_A$ identical to $L_D$ but using agent-experienced samples from replay
- Prioritized sampling in the replay buffer uses recent loss change.

Optimization is by Adam; target networks are omitted in (Jiang et al., 2020) due to classification-based policy outputs.

5. Exploration and Imitation Acceleration Mechanisms

Agents incorporate strategies for efficient exploration and fast convergence:

State-guided Lévy Flight Search (Jiang et al., 2020) is used during action refinement to sample diverse, long-range policy alternatives. Step lengths are distributed according to a heavy-tailed Lévy process. Mutations and crossovers generate candidate actions, with greedy objective selection.
Imitation Pre-training is performed by running a heuristic solver offline (e.g., Lévy flight search with small $\beta$ ) to generate a dataset of state-action demonstrations. Each policy $\pi_j$ is pre-trained to minimize imitation loss, with pre-trained weights initializing subsequent DRL. Demonstration data remains in the replay buffer for periodic supervised updates throughout DRL training.

Table: Key Auxiliary Mechanisms

Mechanism	Implementation Details	Impact
Lévy Flight	Heavy-tailed random step search	Enhanced exploration, quicker local optima
Imitation Accel.	Pre-training on demonstration set, buffer	Higher initial accuracy, faster convergence

6. Stepwise Training Procedure

The ensemble-assisted frameworks follow a two-level training protocol:

Privacy-preserving DQN (Shi, 2021):

Each agent collects experience, calculates local/global gradients.
Local models are updated independently; encrypted global gradients are aggregated and applied synchronously.
Periodically synchronize target networks for stability.

Ensemble-MEC DRL (Jiang et al., 2020):

Centralized training with access to full information; ensemble aggregation plus Lévy-based action refinement yield training transitions for a global buffer.
Each agent is pre-trained on demonstration data, then jointly trained using both new experiences and demonstration samples.
Execution phase is decentralized: agents operate on local information and their individual policies.

7. Empirical Performance, Theoretical Benefits, and Scalability

Multi-agent ensemble-assisted DRL architectures empirically show superior convergence and sample efficiency over solo or classic multi-agent DRL (Shi, 2021, Jiang et al., 2020):

Collaboration via shared global layers or ensemble voting captures universal structure, allowing local models to specialize efficiently and reducing redundant learning efforts.
In privacy-preserving DQN, collaborating agents in identical environments converge in about 30 epochs to high average return, compared to >200 for isolated agents; benefits persist as environmental heterogeneity increases, though diminish with less shared structure (Shi, 2021).
In MEC scheduling, ensemble DRL with imitation achieves faster convergence, higher accuracy, and lower resource allocation cost compared to actor-critic, DDPG, random, local, or greedy baselines (Jiang et al., 2020).
Exploration and imitation mechanisms further improve results: imitation boosts initial performance, and Lévy search enables robust exploration of large, combinatorial action spaces.

A plausible implication is that in large-scale, distributed or privacy-constrained environments, ensemble-assisted multi-agent DRL architectures offer a scalable path toward efficient, collaborative learning without sacrificing agent heterogeneity or privacy.

References:

"A Privacy-preserving Distributed Training Framework for Cooperative Multi-agent Deep Reinforcement Learning" (Shi, 2021)
"Distributed Resource Scheduling for Large-Scale MEC Systems: A Multi-Agent Ensemble Deep Reinforcement Learning with Imitation Acceleration" (Jiang et al., 2020)