- The paper presents a novel decentralized reinforcement learning framework that trains a 32B-parameter model using a globally distributed, heterogeneous compute network.
- It introduces key open-source components—prime-rl, shardcast, and toploc—that decouple inference, training, and validation for efficient, asynchronous operation.
- Experimental results demonstrate improved task rewards on mathematics and coding benchmarks, underscoring the approach's potential for scalable, inference-heavy models.
This paper introduces INTELLECT-2 (2505.07291), the first effort to train a 32-billion-parameter LLM using globally decentralized reinforcement learning (RL). Unlike traditional centralized training requiring large, co-located GPU clusters, INTELLECT-2 utilizes a permissionless, globally distributed network of heterogeneous compute contributors. This approach is presented as a paradigm shift, highlighting RL's inherent suitability for asynchronous, decentralized infrastructure, particularly for enabling test-time compute scaling in LLMs.
The core contribution lies in the development and integration of several novel, open-source infrastructure components:
- prime-rl: A new framework specifically designed for distributed asynchronous reinforcement learning. It achieves efficiency by decoupling rollout generation (inference), model training, and weight broadcasting into distinct, asynchronously communicating components. This separation allows trusted centralized nodes to handle training while untrusted decentralized nodes perform inference rollouts, hiding latency associated with data transfers and eliminating the need for centralized orchestrators like Ray [moritz2018raydistributedframeworkemerging]. The framework implements the GRPO algorithm with auxiliary KL and entropy losses.
- shardcast: A library for efficiently distributing large model checkpoints to decentralized inference workers. It uses an HTTP-based tree-topology network with relay servers, similar to a CDN. Key features include sharding and pipelining checkpoint files to enable early downloads, rate limiting and dynamic firewall rules for security, probabilistic load balancing based on estimated throughput to maximize client download speed, and SHA-256 checksum verification to ensure the integrity of assembled model weights on inference nodes.
- toploc: A verifiable inference mechanism that ensures untrusted inference workers perform computations correctly without requiring a trusted environment. It uses a locality-sensitive hashing scheme to generate cryptographic commitments for final hidden states during decoding. Trusted validator nodes can reconstruct these activations using prefill and compare them to submitted commitments significantly faster than the original inference time (up to 100x speedup). toploc [toploc] is robust to GPU non-determinism and different tensor parallel configurations and can detect the use of incorrect or quantized models.
- Prime Intellect Protocol: A decentralized orchestration layer built in Rust to coordinate the permissionless compute nodes. It manages node registration, health checks via heartbeats, task scheduling (pull-based), and integrates with toploc for inference validation. Information about training runs, ownership, and worker contributions is stored on a decentralized ledger. The system architecture involves a discovery service (Redis-based) and an orchestrator (Kubernetes-hosted), though the authors note these components are currently centralized and plan to move towards a fully peer-to-peer DHT architecture in the future.
The training recipe for INTELLECT-2 builds upon QwQ-32B [qwq32b] and the GRPO algorithm, adapting it for the decentralized setting. Key aspects of the training recipe include:
- Training Data & Rewards: Uses a curated dataset of 285k mathematics and coding tasks from sources like NuminaMath-1.5 [numina_math_datasets], Deepscaler [deepscaler2025], and SYNTHETIC-1 [synthetic1release2025]. A dual objective is used: binary task rewards (1 for correct, 0 for incorrect) and length rewards to teach the model to adhere to a specified thinking budget [aggarwal2025l1controllinglongreasoning]. The total reward is calculated as rtotal​(y,ltarget​)=rtask​(y)−α∗∣ltarget​−ly​∣. Target lengths are sampled from a discrete set (e.g., {1000, 2000, 3000, 4000} or {2000, 4000, 6000, 8000, 10000}).
- Asynchronous RL: Rollouts are collected using policy weights from steps prior to the current training step (two-step asynchrony in the main run). Ablation studies showed that asynchronous training with up to four steps of delay did not significantly hurt performance compared to synchronous baselines [huang2023cleanbareproducibleefficientdistributed, noukhovitch2025asynchronousrlhffasterefficient].
- Offline & Online Data Filtering: The training dataset is pre-filtered based on the base model's performance to remove tasks that are too easy or too difficult, significantly improving training signal. Online filtering ensures that training batches contain samples with non-zero advantages by continuing to sample rollouts until such a batch is formed.
- Two-Sided GRPO Clipping: To mitigate training instabilities, especially in larger models, the standard one-sided GRPO clipping is modified. An upper bound δ is introduced for the token probability ratio πθold​(oi,t​∣q,oi,<t​)​πθ​(oi,t​∣q,oi,<t​)​ even in the case of negative advantages, preventing excessively large gradient updates when moving away from bad rollouts.
- Mitigating Instability at Scale: The paper identifies and discusses several sources of instability in large-scale RL, including escalating gradient norms, increasing token probability clip ratios, and a characteristic entropy loss pattern (initial decrease followed by increase). Aggressive gradient clipping (thresholds 0.05-0.1) was found effective in mitigating gradient norm escalation and delaying instability [wortsman2023smallscaleproxieslargescaletransformer, chowdhery2022palmscalinglanguagemodeling, molybog2023theoryadaminstabilitylargescale, cohen2024adaptivegradientmethodsedge]. The use of
torch.compile
was also found to cause instabilities and was disabled.
- Sequence Packing: To improve efficiency with long sequence lengths (32K max), sequence packing was implemented by adapting the attention mask and collating samples. This is crucial for RL where full samples are required, unlike pretraining.
Experimental results demonstrate the feasibility and performance of the decentralized training setup. The infrastructure components successfully overlapped communication and computation. In the experiments, the shardcast broadcast of 62GB weights averaged 14 minutes (590 Mb/s), toploc validation typically completed within 1 minute, and a batch of verified samples was available within 22-29 minutes after broadcast, ensuring minimal GPU idle time. The ratio of training to inference FLOPs averaged 1:4.5, confirming inference is the dominant compute cost.
INTELLECT-2 showed significant improvement in task rewards on mathematics and coding problems during training. While length penalties also decreased, the model did not fully learn to adhere to the precise thinking budget within the experimental timeframe. Benchmarking against QwQ-32B and other models [deepseekai2025deepseekr1incentivizingreasoningcapability] showed that INTELLECT-2 improved performance on math and coding benchmarks (AIME24, AIME25, LiveCodeBench) while experiencing a slight drop on IFEval, likely due to focused training on math and coding tasks.
The authors discuss the implications of this decentralized approach for the emerging test-time compute scaling paradigm, arguing that asynchronous RL's ability to hide communication and the shift towards inference-heavy workloads make it particularly well-suited for global, heterogeneous compute. Future work directions include exploring RL recipes that increase the inference-to-training compute ratio (e.g., VinePPO [kazemnejad2024vineppounlockingrlpotential]), integrating tool calls, crowdsourcing RL tasks and environments, and applying model merging techniques like DiLoCo [diloco] to scale training further. The trained model, tasks, verifier environments, and the prime-rl framework are open-sourced to foster further research in decentralized AI training.