Federated Q-Learning

Updated 26 November 2025

Federated Q-Learning is a decentralized reinforcement learning method where agents update local Q-values and collaborate through periodic aggregation without sharing raw data.
It leverages local updates and synchronized aggregation to balance sample efficiency and communication constraints, even under heterogeneous conditions.
Empirical results in applications like 5G load balancing and IoT networks demonstrate significant performance gains and robustness in real-world deployments.

Federated Q-Learning (FedRL) is a paradigm within Federated Reinforcement Learning in which multiple agents, distributed across computational nodes or physical locations, collaboratively learn or refine action-value functions (Q-functions). These agents do so without sharing raw experience data, but instead coordinate via aggregation and synchronization of Q-function parameters or value estimates under the orchestration of a central server. FedRL generalizes classic Q-learning to decentralized settings where privacy, communication constraints, or structural heterogeneity preclude centralized data collection.

1. Formulation and Algorithmic Foundations

FedRL assumes a set of $N$ (or $K$ ) agents, each interacting with its own instance of an environment formulated as a Markov Decision Process (MDP) or with a local sampler. The canonical objective is estimation of the optimal Q-function $Q^*$ that satisfies the Bellman optimality condition

$Q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s' \sim P(\cdot|s,a)} [\max_{a'} Q^*(s',a')],$

where $r(s,a)$ is the reward, $P$ is the transition kernel, and $\gamma < 1$ is the discount factor.

A typical synchronous protocol proceeds as follows:

Local Q-update: Each agent $k$ executes local Q-learning, using its data or samples, performing

$Q^{k}_{t+\frac{1}{2}}(s,a) = (1-\eta) Q^k_t(s,a) + \eta \left[ r(s,a) + \gamma \max_{a'} Q^k_t(s',a') \right],$

where $s' \sim P^k(\cdot|s,a)$ is agent-specific.

Periodic aggregation: Every $E$ local steps, the server aggregates local Q-estimates, commonly via uniform averaging:

$Q^{k}_{t+1}(s,a) = \frac{1}{N} \sum_{j=1}^N Q^j_{t+\frac{1}{2}}(s,a).$

Agents then replace or combine their local Q with the aggregated model.

Termination: The process continues until desired accuracy is achieved or resource budgets are exhausted.

Variants include asynchronous protocols, intermittent communication schemes, and importance-weighted aggregation (Woo et al., 2023, Salgia et al., 30 Aug 2024).

2. Theoretical Guarantees and Communication-Sample Complexity

Early analysis focused on the trade-off between communication complexity (frequency and cost of aggregation) and per-agent sample complexity. For homogeneous agents (identical dynamics and reward), linear speedup ($1/N$ scaling) in sample complexity is achievable by distributing the global sample budget across agents (Woo et al., 2023). The following sample complexity bounds are representative:

$T = \widetilde{O}\left( \frac{|S||A|}{N (1-\gamma)^5 \varepsilon^2} \right)$

for achieving $\| Q - Q^* \|_\infty < \varepsilon$ (Woo et al., 2023, Salgia et al., 30 Aug 2024).

Critical lower bounds establish that any protocol obtaining this speedup must incur at least $\Omega(1/(1-\gamma))$ communication rounds (up to logarithmic factors), and a total communication volume proportional to the state-action space (Salgia et al., 30 Aug 2024). Algorithms such as Fed-DVR-Q are order-optimal in both sample and communication complexity:

Sample: $O\left( \frac{|S||A|}{N (1-\gamma)^3 \varepsilon^2} \right)$ ,
Rounds: $O\left( \frac{1}{1-\gamma} \log \frac{1}{(1-\gamma)\varepsilon} \right)$ ,
Bits: $O\left(\frac{|S||A|}{1-\gamma} \log \frac{1}{(1-\gamma)\varepsilon}\right)$ (Salgia et al., 30 Aug 2024).

Achieving optimality requires careful design (variance reduction, large batch local updates, quantized aggregation) and is fundamentally limited by the contraction properties of the Bellman operator and the non-i.i.d. nature of reinforcement learning data.

3. Heterogeneous Agents and BLack-Box Architectures

In practical scenarios, participating agents are often heterogeneous with respect to their MDPs, policy parameterizations, data distributions, or computational budgets. The "FedRL-HALE" setting (Federated RL with Heterogeneous And bLack-box agEnts) captures cases where each agent $n$ maintains a private Q-network $Q(\cdot, \cdot; \theta_n)$ (possibly distinct architecture, optimizer, data buffer), and only exposes value estimates for prescribed queries (Fan et al., 2023).

The Federated Heterogeneous Q-Learning (FedHQL) framework decouples agent architectures and permits "black-box" participation. The protocol features:

Federated UCB exploration: The server computes an action $a_t$ via UCB over the agents' $Q_n(s_t, a)$ vectors.
Federated temporal-difference (TD) update: Aggregated $Q$ -values are updated via server-side Bellman backup, and broadcast to agents.
Local knowledge distillation: Agents locally align their $Q_n$ to the federated $Q$ via loss minimization.

This architecture:

Enables collaboration among disparate agents;
Preserves data/model privacy (no gradients or raw samples shared);
Empirically yields $2$– $3\times$ sample-efficiency gains, robust consensus learning, and improved performance for weak agents (Fan et al., 2023).

FedHQL and analogous approaches handle both parametric heterogeneity and dynamics heterogeneity but pose open challenges for global convergence analysis and theoretical optimality beyond value-bounded stability.

4. Impact of Heterogeneity and Adaptive Aggregation

In the presence of heterogeneous dynamics (distinct $P^k$ across agents), convergence behavior can be fundamentally altered. In synchronous averaging schemes with interval $E>1$ , heterogeneity induces a bias that does not vanish even with infinite samples:

$\| Q_T - Q^* \|_\infty \geq \Omega\left(\frac{E}{T}\right),$

where $E$ is the local update interval and $T$ the total updates. The bias scales with the degree of transition kernel mismatch $\kappa = \sup_{k,s,a} \| \bar P(\cdot|s,a)-P^k(\cdot|s,a) \|_1$ (Wang et al., 5 Sep 2024). Hence,

Frequent aggregation ( $E=1$ ) is necessary in highly heterogeneous settings;
Multi-phase, adaptive step-size schedules can mitigate bias: aggressive exploration early, reduced learning rate post phase-transition;
Estimating $\kappa$ in practice guides the choice of $E$ and step-size (Wang et al., 5 Sep 2024).

An important positive result is the "blessing of heterogeneity": even when individual agents’ trajectories do not cover the entire state-action space, learning is possible as long as the union of their behaviors ensures collective coverage. Robust linear speedup is possible with correct aggregation weights (visit-count-based importance averaging) over randomly and heterogeneously sampled data (Woo et al., 2023).

5. Advances in Communication-Efficient and Gap-Dependent Protocols

Modern FedRL research aims at protocols that simultaneously minimize communication cost and achieve optimal regret/sample rates:

Event-triggered communication: Synchronization and aggregation are invoked only upon certain data-driven events (e.g., visit-count thresholds), leading to $O(\log T)$ communication rounds while maintaining minimax regret rates for episodic MDPs (Zheng et al., 2023, Zheng et al., 29 May 2024).
Gap-dependent analysis: In favorable MDPs where suboptimal actions have positive gaps $\Delta_{\min}>0$ , FedRL can achieve $\log T$ -type regret instead of $\sqrt{T}$ , and further reduce the communication cost, removing dependence on agent/state/action number inside logarithms (Zhang et al., 5 Feb 2025).
Variance reduction and reference-advantage decomposition: Using fixed reference values and advantage estimates for Q-updates reduces variance and lowers the leading constant in regret and communication bounds (Zheng et al., 29 May 2024).
Compression: Incorporating unbiased and error-feedback-based compression operators (e.g., Top- $K$ , Sparsified- $K$ ) at the per-round aggregation step can reduce bandwidth up to $90\%$ over standard uncompressed aggregation, without statistically significant loss in solution quality (Beikmohammadi et al., 26 Mar 2024).

These techniques allow scaling FedRL to high-dimensional settings and stringent communication constraints.

6. Practical Applications and Empirical Evaluations

FedRL has been deployed for real-world networked and multi-agent control systems:

5G and Open-RAN load balancing: Devices locally train DQNs for handover decisions and aggregate models at the near-RT controller via FedAvg. This achieves $12\%$ higher throughput and $30\%$ lower load variance than MAX-SINR baselines (Lin et al., 10 Feb 2024).
IoT and edge networks: Federated DDQN and DQN schemes for offloading and energy-delay minimization achieve up to $3\times$ faster convergence compared to non-federated baselines, with observed acceleration attributable to knowledge transfer via model aggregation (Zarandi et al., 2021).
Teleoperated and vehicular networks: Tabular and neural federated Q-learning outperform both centralized and stateless RL for predictive QoS, balancing end-to-end delay and compression fidelity in strict real-time environments (Bragato et al., 3 Oct 2024).
Edge cloud task allocation/routing: Federated Deep Q-Learning enables decentralized task routing with reduced fronthaul delay and improved global reward (Ndikumana et al., 2023).

Empirically, federated protocols enable privacy-preserving, distributed RL with lowered computation and communication burdens on central controllers and enhanced robustness to agent failures or stragglers.

7. Limitations, Open Problems, and Future Directions

Despite the significant advances, FedRL remains subject to several open challenges:

Theoretical characterization of convergence and optimality for deep, heterogeneous, and non-tabular Q-networks remains incomplete (Fan et al., 2023).
Handling severe dynamics heterogeneity, non-stationary environments, and partial coverage without loss of statistical or communication efficiency is an area of active research (Wang et al., 5 Sep 2024).
Practical security and privacy constraints (e.g., differential privacy, secure aggregation) for Q-models and gradients are not fully addressed in most protocols (Lin et al., 10 Feb 2024).
Extending federated methodologies to actor-critic, policy-gradient, or value-prediction settings, and developing communication-adaptive and asynchrony-robust designs, represent current frontiers (Fan et al., 2023, Jin et al., 7 May 2024).

In summary, federated Q-learning provides rigorous, scalable mechanisms for distributed RL under privacy, heterogeneity, and communication constraints. A growing body of theory quantifies the conditions under which linear speedup, minimal communication, and robustness to agent and data non-uniformity can be achieved. Ongoing research seeks to generalize these guarantees to broader classes of RL algorithms and to more severe heterogeneity and limited communication scenarios.