FedSARSA: Distributed RL with Federated Learning

Updated 24 December 2025

Federated SARSA is a distributed reinforcement learning framework that integrates on-policy SARSA with federated protocols to collaboratively learn policies from heterogeneous environments while preserving data locality.
It synchronizes local Q-tables or linear approximators via centralized averaging or decentralized mobile-agent aggregation, achieving provable linear speed-up in sample complexity.
The approach balances bias and variance through local update frequencies, demonstrating practical success in multi-robot and teleoperated driving scenarios with finite-time convergence guarantees.

Federated SARSA (FedSARSA) is a distributed reinforcement learning scheme that combines the on-policy SARSA algorithm with a federated learning protocol across heterogeneous agents, leveraging either parameter averaging via a central server or decentralized model aggregation via mobile agents. The FedSARSA framework enables multiple agents, each interacting with distinct (possibly non-identical) Markov decision processes (MDPs), to collaboratively learn near-optimal policies while preserving data locality, addressing privacy, bandwidth, and robustness requirements. Key variants include both tabular/Q-table-based aggregation (as in decentralized robot scenarios) and linear function approximation for high-dimensional or continuous state-action spaces. FedSARSA achieves provable linear speed-up in sample complexity with respect to the number of agents and is robust to moderate heterogeneity across agent environments, as analytically established in recent work with finite-time guarantees (Zhang et al., 27 Jan 2024, Mangold et al., 19 Dec 2025, Nair et al., 2022, Bragato et al., 3 Oct 2024).

1. Federated SARSA Formulations and Problem Setup

FedSARSA operates over a population of $N$ agents, each interacting with its own MDP $\mathcal{M}^{(i)} = (\mathcal{S}, \mathcal{A}, P_i, r_i, \gamma)$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ the action space, $P_i$ the environment-specific transition kernel, $r_i$ the reward function, and $\gamma\in(0,1)$ the discount factor. Environments may differ across agents, introducing heterogeneity quantified by

$\epsilon_p = \max_{i,j} \|P_i - P_j\|_{\mathrm{TV}}, \qquad \epsilon_r = \max_{i,j} \|r_i - r_j\|_\infty,$

where $\|\cdot\|_{\mathrm{TV}}$ denotes total variation.

Each agent maintains a local policy $\pi_{\theta^{(i)}}$ parameterized by weights $\theta^{(i)}$ . In practical deployments, the action-value function $Q(s,a)$ is typically approximated linearly as $Q_\theta(s,a)=\phi(s,a)^\top \theta$ , with feature map $\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^d$ and possibly a softmax or $\varepsilon$ -greedy derived behavior policy.

The FedSARSA protocol periodically aggregates local model parameters:

In server-based regimes, local parameter vectors are averaged using FedAvg: $\bar{\theta} = \frac{1}{N}\sum_{i=1}^N \theta^{(i)}$ .
In fully decentralized settings, a migrating agent aggregates Q-tables or parameters via weighted averaging or alternative rules (Nair et al., 2022). This design enables collaborative policy improvement without centralized data pooling.

2. Local SARSA Update and Federated Synchronization

Each agent executes standard on-policy SARSA updates locally:

At time $t$ , observe $s_t$ , select $a_t\sim\pi_{\theta^{(i)}_t}(\cdot|s_t)$ .
Execute $a_t$ in $P_i$ , observe $r_t$ , $s_{t+1}$ ; select $a_{t+1}\sim\pi_{\theta^{(i)}_t}(\cdot|s_{t+1})$ .
Compute TD error and update weights:

$\theta^{(i)}_{t+1} = \theta^{(i)}_t + \alpha_t \cdot \phi(s_t,a_t) \left[ r_t + \gamma \phi(s_{t+1}, a_{t+1})^\top \theta^{(i)}_t - \phi(s_t, a_t)^\top \theta^{(i)}_t \right].$

Optionally, apply per-action visitation tracking for tabular Q-table cases.

Every $K$ steps, synchronization is triggered: Each agent transmits its current weights to a central aggregator (or a mobile agent sequentially collects tables). Aggregation typically uses weighted or uniform averaging, after which the global parameter is redistributed to all participants, resetting local weights. In the tabular decentralized case, aggregation may instead average Q-table values per-visit or by a “max” rule (Nair et al., 2022).

The prototypical FedSARSA pseudocode is as follows (Mangold et al., 19 Dec 2025, Zhang et al., 27 Jan 2024):

for t = 0 to T-1:
    # Local SARSA steps
    θ_i ← SARSA_update(θ_i)
    if t % K == 0:
        send θ_i to server
        receive θ̄ = aggregate({θ_j})
        θ_i ← θ̄

For decentralized multi-robot setups, a mobile agent visits each device in sequence, performing weighted merges of Q-tables and redistributing the aggregate (Nair et al., 2022).

3. Finite-Time Convergence Analysis and Sample Complexity

Modern theoretical analyses of FedSARSA provide precise non-asymptotic convergence guarantees in the presence of agent heterogeneity, concurrent local updates, and Markovian sampling.

On-policy FedSARSA with linear value function approximators converges to a universal fixed point $\theta_*$ associated with the average of all agent environments. Formally, the average parameter

$\bar{\theta}_t = \frac{1}{N}\sum_{i=1}^N \theta^{(i)}_t$

obeys the recursion

$\mathbb{E}\left\|\bar{\theta}_{t+1} - \theta_*\right\|^2 \leq (1 - w\alpha_t)\mathbb{E} \left\| \bar{\theta}_t - \theta_* \right\|^2 + \alpha_t O(\epsilon_p^2 + \epsilon_r^2) + \frac{\alpha_t^2}{N} O(1) + \text{higher order terms},$

where $w>0$ is a problem-specific contraction constant. Concrete rates (Zhang et al., 27 Jan 2024, Mangold et al., 19 Dec 2025):

With constant $\alpha_t=\alpha_0$ :

$\mathbb{E}\left\| \bar\theta_T - \theta_* \right\|^2 \leq O(e^{-\alpha_0 wT}) + O\left( \frac{\epsilon_p^2 + \epsilon_r^2}{(1-\gamma)^2} + \frac{\alpha_0}{N(1-\gamma)} \right).$

With linearly decaying step size, sample complexity is

$T = \widetilde{O}\left( \frac{1}{N\varepsilon^2} + \frac{(\epsilon_p+\epsilon_r)^2}{\varepsilon^2} \right),$

to achieve MSE $\leq \varepsilon^2$ per agent. When environments are nearly homogeneous ( $\epsilon_p, \epsilon_r \to 0$ ), FedSARSA achieves optimal $O(1/(N\varepsilon^2))$ scaling (Zhang et al., 27 Jan 2024, Mangold et al., 19 Dec 2025).

Bias due to heterogeneity scales as $O(\epsilon_p^2 H^2)$ when $H$ local updates are allowed per aggregation, requiring $H \ll 1/\epsilon_p$ to keep bias small (Mangold et al., 19 Dec 2025). Empirically, variance reduction in the error is $O(1/N)$ (Mangold et al., 19 Dec 2025).

4. Decentralized and Serverless FedSARSA: Mobile-Agent Aggregation

In fully decentralized multi-robot architectures, Federated SARSA is implemented via a mobile agent (platform: Tartarus) that physically migrates between networked robotic hosts to retrieve and aggregate local Q-tables (Nair et al., 2022). The aggregation rule, for each state-action pair $(s,a)$ , is: $Q_{\mathrm{new}}(s,a) = \frac{N^{\mathrm{agg}}(s,a) Q^{\mathrm{agg}}(s,a) + N_i(s,a) Q_i(s,a)} {N^{\mathrm{agg}}(s,a) + N_i(s,a)},$ where $N$ denotes visit counts, and the mobile agent repeats forward and backward passes to collect and synchronize all agents' Q-tables. Communication complexity per sync is $O(M S K)$ for $M$ agents, $S$ states, $K$ actions—e.g., for $10$ robots, $20$ bins, $3$ actions $\implies$ $\approx 10$ KB per round (Nair et al., 2022).

The protocol ensures robustness: if an agent is unavailable, it is simply skipped; all model synchronization is mediated by the mobile agent, eliminating any server bottleneck.

Empirical findings demonstrate that decentralized FedSARSA accelerates policy learning in heterogeneous robot arenas, outperforming standalone SARSA in generalization to novel environments; convergence is observed for both average and "max" aggregation rules (Nair et al., 2022).

5. Practical Applications and Experimental Outcomes

FedSARSA has been deployed and benchmarked in multiple domains:

Multi-Robot Obstacle Avoidance: In Webots-based scenarios, robots running independent instances synchronize tabular Q-tables via a mobile agent (Nair et al., 2022). Aggregated policies demonstrate faster adaptation and improved generalization to unseen layouts.
Teleoperated Driving Networks: In the context of predictive Quality of Service (PQoS) optimization for end-to-end latency and compression in 6G teleoperation, FedSARSA agents (1 per vehicle) periodically synchronize linear function weights via a central server (Bragato et al., 3 Oct 2024). Table summarizing protocol configuration and empirical results:

Method	Avg Reward	Regret	#Params	Update Time
SARSA	0.533	3.63	171	0.39 ms
DSARSA	0.532	10.90	19,529	4.4 ms
Q-Learning	0.535	3.63	171	0.52 ms
DDQN	0.535	9.01	19,529	5.1 ms

Linear FedSARSA guarantees low-latency, high-fidelity compression with negligible communication and submillisecond per-step computation. In congested networks, delay targets are met in $>90\%$ of frames, outperforming static or centralized alternatives. Privacy is satisfied as only weight vectors are transmitted; raw observations remain local.

Heterogeneous Garnet Benchmarks: FedSARSA converges to a federated fixed-point parameter with MSE decreasing as $O(1/N)$ as $N$ increases; bias grows with environment heterogeneity and local update period, confirming theoretical predictions (Mangold et al., 19 Dec 2025).

6. Key Design Tradeoffs and Comparisons

FedSARSA offers several advantages:

Linear speed-up: With $N$ agents, sample complexity improves from $O(1/\varepsilon^2)$ for single-agent SARSA to $O(1/(N\varepsilon^2))$ in homogeneous populations.
Robustness: Decentralized protocols can bypass failed agents with no loss of correctness (Nair et al., 2022).
Privacy: Only parameter vectors or Q-tables are shared, protecting local data (Bragato et al., 3 Oct 2024).
Bias-variance tradeoff: Heterogeneity introduces a steady-state bias $O((\epsilon_p+\epsilon_r)^2)$ ; large local-update periods $H$ amplify this bias.
Comparison to off-policy FL: While both Q-learning and SARSA can be federated, FedSARSA delivers the first theoretical finite-time and error guarantees for on-policy federation under Markovian sampling and time-varying behavior (Zhang et al., 27 Jan 2024). Simulation results indicate that off-policy methods may achieve slightly higher rewards but require careful exploration tuning.

Recommended settings are $H$ (local steps per round) moderate—large enough to reduce communication, small enough to bound heterogeneity amplification; learning rates decay over time, and policy improvement operators should be Lipschitz-continuous to guarantee stability (Zhang et al., 27 Jan 2024, Mangold et al., 19 Dec 2025).

7. Future Directions and Open Challenges

Recent advances have closed key gaps in the theoretical understanding of federated on-policy RL with heterogeneity and Markovian dynamics. However, several open questions remain:

Nonlinear Function Approximation: Most analyses to date address linear SARSA; neural variants (“FedDSARSA”, “FedDDQN”) scale to high-dimensional input but lack comparable performance guarantees (Bragato et al., 3 Oct 2024).
Partial Participation and Non-IID Sampling: Trade-offs in asynchronous regimes and under missing agents require further study.
Heterogeneity Mitigation: Adaptive synchronization and personalized aggregation rules are promising for large-scale, highly diverse deployments.

Continued empirical validation in real or large-scale simulated environments, as well as extensions to actor-critic and policy gradient variants, will further clarify the capabilities and limitations of the FedSARSA framework.

References:

(Nair et al., 2022) On Decentralizing Federated Reinforcement Learning in Multi-Robot Scenarios
(Zhang et al., 27 Jan 2024) Finite-Time Analysis of On-Policy Heterogeneous Federated Reinforcement Learning
(Mangold et al., 19 Dec 2025) Convergence Guarantees for Federated SARSA with Local Training and Heterogeneous Agents
(Bragato et al., 3 Oct 2024) Federated Reinforcement Learning to Optimize Teleoperated Driving Networks