PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

Published 14 Apr 2026 in cs.LG | (2604.12160v1)

Abstract: Reasoning post-training with reinforcement learning from verifiable rewards (RLVR) is typically studied in centralized settings, yet many realistic applications involve decentralized private data distributed across organizations. Federated training is a natural solution, but scaling RLVR in this regime is challenging: full-model synchronization is expensive, and performing many local steps can cause severe client drift under heterogeneous data. We propose a federated RLVR framework that combines LoRA-based local adaptation with public-data-based off-policy steps to improve both communication efficiency and cross-client coordination. In particular, a small shared public dataset is used to periodically exchange and reuse response-level training signals across organizations, providing a lightweight anchor toward a more globally aligned objective without exposing private data. Our method selectively replaces locally incorrect responses with globally correct ones during public-data steps, thereby keeping training closer to the local policy while still benefiting from cross-client coordination. Across mathematical and medical reasoning benchmarks and models, our method consistently improves over standard baselines. Our results highlight a simple and effective recipe for federated reasoning post-training: combining low-rank communication with limited public-data coordination.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces PubSwap, a framework combining LoRA-based local adaptation with public-data coordination to reduce communication costs and improve accuracy in federated RLVR.
It leverages a balanced aggregation method that selectively swaps incorrect responses to align heterogeneous client models while preserving data privacy.
Empirical results on math and medical reasoning benchmarks demonstrate notable performance gains and robustness compared to conventional federated methods.

PubSwap: Public-Data Off-Policy Coordination for Federated RLVR (2604.12160)

Overview

The PubSwap framework directly addresses the core challenges of federated reinforcement learning from verifiable rewards (RLVR) in the context of LLMs applied to mathematical and medical reasoning. By integrating LoRA-based (Low-Rank Adaptation) local adaptation with periodic, public-data-driven off-policy coordination, PubSwap achieves communication-efficient federated post-training while preserving alignment across heterogeneous client distributions and maintaining data privacy.

Motivation and Problem Context

Federated RLVR is motivated by scenarios where training data—especially for domains like medicine or finance—cannot be centralized due to privacy and regulatory constraints. Existing solutions to federated training, including full-model synchronization and communication-efficient variants such as quantization and sparsification, are insufficient for RLVR: full fine-tuning incurs prohibitive communication costs, and increased local steps between synchronizations lead to substantial client drift, especially in heterogeneous data regimes. Moreover, available public data is small and underutilized in conventional federated RL methods.

RLVR post-training, particularly via algorithms such as Group Relative Policy Optimization (GRPO), substantially enhances LLM reasoning ability by leveraging verifiable rewards. However, most previous RLVR work assumes centralized or fully public data. The need is therefore acute to develop RLVR approaches that scale to practical, decentralized, privacy-sensitive applications.

Methodology

PubSwap is built on two mechanisms:

1. LoRA-based Local Adaptation

Each federated client maintains and optimizes only low-rank adaptation (LoRA) parameters for each model layer, leaving the pretrained model backbone frozen. Communication with the server consists solely of LoRA updates, significantly reducing communication costs from $\mathcal{O}(md)$ to $\mathcal{O}(r(m + d))$ per layer, with $r \ll m, d$ . Aggregation follows the FedIT methodology, averaging LoRA updates across clients, allowing for memory- and communication-efficient fine-tuning that remains feasible even for very large models.

2. Public-Data Off-Policy Coordination via "PubSwap"

Training alternates between local GRPO updates on private data and periodic steps where all clients jointly act on a small set of shared public prompts:

Public Step: All clients generate $K$ responses per public prompt and submit these to the server.
Response Aggregation:
- Random: For each prompt, $K$ responses are randomly sampled from the full pool of client generations, and shared with all clients for off-policy GRPO updates.
- Balanced: For each client, if the number of correct local responses falls below $K/2$ , up to $K/2 - C_n$ incorrect responses (where $C_n$ is the number of locally correct responses) are replaced by correct responses from the global pool, excluding self-generated samples.

By leveraging this swap of locally incorrect answers with globally correct ones, PubSwap enforces a coordination anchor while minimizing privacy risks—public data and model outputs are the only items exchanged, and no raw private data is ever communicated.

The frequency of these public-data steps (the PubSwap period, $T_{\text{swap}}$ ) is a tunable hyperparameter that mediates between communication cost and cross-client alignment.

Empirical Results

PubSwap is evaluated on challenging benchmarks: MATH and DeepMath for mathematical reasoning, as well as MedQA and MedMCQA for medical reasoning, using Qwen2.5-MATH-1.5B, Qwen3-1.7B, Qwen3-4B-Instruct, and Llama3.2-3B-Instruct model families. The evaluation measures pass@1 accuracy, with extensive ablations on local-step size, swap period, and data heterogeneity.

Key Findings

Significant Accuracy Gains: PubSwap (with Balanced aggregation) consistently outperforms both FedAvg-GRPO and FedProx-GRPO across all models, datasets, and heterogeneity regimes. For instance, on DeepMath with Qwen3-1.7B, PubSwap attains up to 55.8% pass@1 versus 50.7% with FedAvg at high local-step sizes.
Robust to Heterogeneity: The relative advantage of PubSwap increases as data distributions across clients become more heterogeneous (Dirichlet $\alpha = 0.3 \to 0.1$ ).
Efficiency at Scale: Gains are most pronounced at large local-step settings, critical for practical deployments where frequent synchronization is infeasible.
Comparison of Aggregation Methods: The Balanced strategy is superior to Random at large local-step counts due to reduced off-policy instability. For smaller local steps, Random can transiently outperform due to more aggressive sharing of globally correct responses.
Nontrivial Behavior vs. FedProx: Notably, FedProx often yields no improvement (sometimes even degrades accuracy) relative to vanilla FedAvg-GRPO in federated RLVR, unlike in supervised FL, highlighting the difficulty of controlling drift in policy space.

Theoretical Analysis

The appendix provides rigorous drift analysis quantifying the impact of private vs. public-data-based steps on divergence between clients. The key insight is that:

Private steps induce drift proportional to the maximal heterogeneity of private gradients, scaling with the smoothness parameter $\mathcal{O}(r(m + d))$ 0.
Balanced steps, by anchoring to shared public data and selectively swapping incorrect responses, induce substantially less drift, bounded by the corresponding public-data smoothness $\mathcal{O}(r(m + d))$ 1 and the fraction of required replacements.

This formalization justifies PubSwap’s empirical stability and alignment even under substantial heterogeneity.

Implications and Future Directions

PubSwap demonstrates a practical mechanism for federated RLVR that achieves efficient, scalable, and privacy-preserving collaborative model reasoning. The approach is agnostic to the downstream task, as evidenced by gains on distinct reasoning domains. The selective, reward-based swapping in Balanced aggregation offers an interpretable control knob over the bias-variance tradeoff induced by off-policy data.

Potential extensions include:

Adaptive Coordination: Dynamically tuning the PubSwap period, leveraging curriculum-based sampling of public prompts, or learning informativeness-based prompt assignments to optimize cross-client synergy.
Stronger Off-Policy Corrections: Integrating advanced off-policy correction (e.g., importance weighting or specialized reward shaping) to further reduce bias and improve sample efficiency.
Generalization to Other RL Tasks: Application beyond language reasoning, such as federated RL with privacy constraints in robotics, finance, or healthcare.

Conclusion

PubSwap establishes a principled and empirically validated framework for federated RLVR with LLMs operating on sensitive, distributed datasets. By decomposing adaptation into LoRA-based local steps and lightweight, public-data-coordinated off-policy steps, it achieves accuracy gains, communication efficiency, and robustness to heterogeneity without direct data sharing. This framework opens up new opportunities for collaborative AI model training in privacy-sensitive, data-siloed domains.

Markdown Report Issue