Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing (2509.08721v1)

Published 10 Sep 2025 in cs.LG and cs.MA

Abstract: Post-training LMs with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.

Summary

The paper introduces SAPO, a decentralized RL algorithm that achieves a 94% cumulative reward improvement by balancing local and external rollouts.
It employs asynchronous, multi-agent experience sharing to enhance policy gradient updates and stabilize training in heterogeneous networks.
Empirical results show that optimal local/external balance not only accelerates learning but also outperforms traditional RL approaches in resource-constrained environments.

Introduction

The paper presents Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm for LMs. SAPO is designed to operate in heterogeneous, distributed networks where each node manages its own policy and shares decoded rollouts with others, enabling collaborative learning without synchronization or homogeneity constraints. The approach is motivated by the need to scale RL post-training for LMs efficiently, circumventing the latency, memory, and reliability bottlenecks of conventional distributed RL systems. The empirical results demonstrate that SAPO achieves up to 94% cumulative reward improvement over standard RL fine-tuning in controlled experiments, and provides insights from large-scale open-source deployments.

Methodology: Swarm Sampling Policy Optimization (SAPO)

SAPO formalizes RL post-training in a decentralized multi-agent setting. Each node in the swarm maintains its own policy $\pi^n$ , dataset $\mathcal{D}^n$ , and reward model $\rho^n$ . Training proceeds in rounds, where each node:

Samples a batch of tasks/questions.
Generates multiple completions (rollouts) per question.
Shares a subset of its rollouts, along with metadata and ground-truth answers, with the swarm.
Constructs its training set by subsampling $I^n$ local and $J^n$ external rollouts (from the swarm).
Computes rewards and updates its policy using a policy-gradient algorithm (e.g., PPO, GRPO).

This process is fully asynchronous and decentralized; nodes can filter, subsample, and re-encode external rollouts as needed. The algorithm is agnostic to model architecture, learning algorithm, and hardware, supporting both homogeneous and heterogeneous swarms.

Figure 1: Baseline case, i.e.~8 local / 0 external rollouts.

Controlled Experiments: Setup and Results

Experiments were conducted using a swarm of eight Qwen2.5 models (0.5B parameters each), orchestrated via Docker and PyTorch distributed. The ReasoningGYM dataset provided procedurally generated, verifiable tasks across symbolic, numeric, and abstract reasoning domains. Each agent generated 8 completions per question, and policies were updated using GRPO with asymmetric clipping and no KL penalty.

Four configurations were evaluated: 8 local/0 external (baseline), 6 local/2 external, 4 local/4 external, and 2 local/6 external rollouts per agent per round. The total number of training samples was held constant.

Figure 2: Average agent rewards for each configuration across training, smoothed with a moving average (window size 100). The 4 local / 4 external configuration consistently outperforms the baseline and, in nearly all rounds, also exceeds the 6 local / 2 external configuration in expected average reward. The 4 local / 4 external configuration also surpasses the 2 local / 6 external setup for most rounds, though the difference is smaller compared to the other cases.

Key findings:

Balanced experience sharing (4 local/4 external) yields the highest cumulative reward ($1093.31$), a 94% improvement over the baseline ($561.79$).
Excessive reliance on external rollouts (2 local/6 external) leads to oscillatory learning and forgetting, attributed to network effects and diminished pool quality.
The moving average of agent rewards confirms that balanced sharing accelerates learning and generalization, while too much external sampling destabilizes training.

Large-Scale Swarm Training: Open-Source Demo Insights

A large-scale open-source demo involved thousands of nodes contributed by the Gensyn community, running diverse models and hardware. Each node participated in judge-based evaluation rounds using ReasoningGYM tasks. Analysis of peer-identified results revealed:

For Qwen2.5 (0.5B), swarm-trained models significantly outperformed isolated counterparts after ~175 normalized rounds, as confirmed by statistical testing.
For higher-capacity models (Qwen3, 0.6B), the performance gap between swarm and isolated training was negligible, suggesting SAPO's benefits are most pronounced for mid-capacity models.
Uniform random sampling of external rollouts led to overrepresentation of low-reward samples; more sophisticated filtering could further enhance performance.
Figure 3: Shown in red are the regions where the adjusted p-value is greater than 0.05. After a certain number of rounds, in this case approximately 175, the performance per round of the models in the swarm significantly exceeds that of the model trained in isolation.

Practical and Theoretical Implications

SAPO demonstrates that decentralized experience sharing can dramatically improve sample efficiency and task performance in RL post-training for LMs, especially in resource-constrained or heterogeneous environments. The approach sidesteps the synchronization, communication, and infrastructure overheads of centralized distributed RL, making it suitable for edge devices and open collaborative networks.

Theoretical implications include:

SAPO bridges single-agent RL fine-tuning and structured multi-agent frameworks, capturing collaborative benefits without explicit specialization or orchestration.
Experience sharing enables rapid propagation of "Aha moments," bootstrapping learning across the swarm.
The trade-off between local and external rollouts is critical; excessive external sampling can induce instability and forgetting.

Practical considerations:

SAPO is agnostic to data modality and model architecture, supporting multi-modal and heterogeneous swarms.
Communication and computational overhead from sharing and re-encoding rollouts is outweighed by collective learning gains.
Adaptive sampling and filtering strategies are necessary to maintain pool quality and stability in large swarms.

Future Directions

The paper identifies several avenues for further research:

Systematic evaluation of SAPO under greater heterogeneity (specialized tasks, diverse base models, human-in-the-loop policies).
Development of meta-strategies for adaptive balancing and filtering of local vs. shared rollouts.
Integration with reward-guided sharing, RLHF, or generative verifiers to enhance stability and robustness.
Exploration of multi-modal swarms and self-organizing feedback loops in collaborative learning.

Conclusion

SAPO offers a scalable, decentralized framework for RL post-training of LMs, leveraging collective experience sharing to accelerate learning and improve generalization. Controlled and large-scale experiments validate its efficacy, with balanced sharing nearly doubling performance over standard RL fine-tuning. The approach is particularly well-suited for heterogeneous, open networks and resource-constrained environments. Future work should focus on adaptive strategies, heterogeneity, and multi-modal extensions to further realize the potential of collaborative RL in LLM post-training.