Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning (2505.07291v1)

Published 12 May 2025 in cs.LG and cs.DC

Abstract: We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter LLM. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers. Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B, the state of the art reasoning model in the 32B parameter range. We open-source INTELLECT-2 along with all of our code and data, hoping to encourage and enable more open research in the field of decentralized training.

Summary

  • The paper presents a novel decentralized reinforcement learning framework that trains a 32B-parameter model using a globally distributed, heterogeneous compute network.
  • It introduces key open-source components—prime-rl, shardcast, and toploc—that decouple inference, training, and validation for efficient, asynchronous operation.
  • Experimental results demonstrate improved task rewards on mathematics and coding benchmarks, underscoring the approach's potential for scalable, inference-heavy models.

This paper introduces INTELLECT-2 (2505.07291), the first effort to train a 32-billion-parameter LLM using globally decentralized reinforcement learning (RL). Unlike traditional centralized training requiring large, co-located GPU clusters, INTELLECT-2 utilizes a permissionless, globally distributed network of heterogeneous compute contributors. This approach is presented as a paradigm shift, highlighting RL's inherent suitability for asynchronous, decentralized infrastructure, particularly for enabling test-time compute scaling in LLMs.

The core contribution lies in the development and integration of several novel, open-source infrastructure components:

  1. prime-rl: A new framework specifically designed for distributed asynchronous reinforcement learning. It achieves efficiency by decoupling rollout generation (inference), model training, and weight broadcasting into distinct, asynchronously communicating components. This separation allows trusted centralized nodes to handle training while untrusted decentralized nodes perform inference rollouts, hiding latency associated with data transfers and eliminating the need for centralized orchestrators like Ray [moritz2018raydistributedframeworkemerging]. The framework implements the GRPO algorithm with auxiliary KL and entropy losses.
  2. shardcast: A library for efficiently distributing large model checkpoints to decentralized inference workers. It uses an HTTP-based tree-topology network with relay servers, similar to a CDN. Key features include sharding and pipelining checkpoint files to enable early downloads, rate limiting and dynamic firewall rules for security, probabilistic load balancing based on estimated throughput to maximize client download speed, and SHA-256 checksum verification to ensure the integrity of assembled model weights on inference nodes.
  3. toploc: A verifiable inference mechanism that ensures untrusted inference workers perform computations correctly without requiring a trusted environment. It uses a locality-sensitive hashing scheme to generate cryptographic commitments for final hidden states during decoding. Trusted validator nodes can reconstruct these activations using prefill and compare them to submitted commitments significantly faster than the original inference time (up to 100x speedup). toploc [toploc] is robust to GPU non-determinism and different tensor parallel configurations and can detect the use of incorrect or quantized models.
  4. Prime Intellect Protocol: A decentralized orchestration layer built in Rust to coordinate the permissionless compute nodes. It manages node registration, health checks via heartbeats, task scheduling (pull-based), and integrates with toploc for inference validation. Information about training runs, ownership, and worker contributions is stored on a decentralized ledger. The system architecture involves a discovery service (Redis-based) and an orchestrator (Kubernetes-hosted), though the authors note these components are currently centralized and plan to move towards a fully peer-to-peer DHT architecture in the future.

The training recipe for INTELLECT-2 builds upon QwQ-32B [qwq32b] and the GRPO algorithm, adapting it for the decentralized setting. Key aspects of the training recipe include:

  • Training Data & Rewards: Uses a curated dataset of 285k mathematics and coding tasks from sources like NuminaMath-1.5 [numina_math_datasets], Deepscaler [deepscaler2025], and SYNTHETIC-1 [synthetic1release2025]. A dual objective is used: binary task rewards (1 for correct, 0 for incorrect) and length rewards to teach the model to adhere to a specified thinking budget [aggarwal2025l1controllinglongreasoning]. The total reward is calculated as rtotal(y,ltarget)=rtask(y)−α∗∣ltarget−ly∣r_{\text{total}}(y, l_{\text{target}}) = r_{\text{task}}(y) - \alpha * | l_{\text{target}} - l_y |. Target lengths are sampled from a discrete set (e.g., {1000, 2000, 3000, 4000} or {2000, 4000, 6000, 8000, 10000}).
  • Asynchronous RL: Rollouts are collected using policy weights from steps prior to the current training step (two-step asynchrony in the main run). Ablation studies showed that asynchronous training with up to four steps of delay did not significantly hurt performance compared to synchronous baselines [huang2023cleanbareproducibleefficientdistributed, noukhovitch2025asynchronousrlhffasterefficient].
  • Offline & Online Data Filtering: The training dataset is pre-filtered based on the base model's performance to remove tasks that are too easy or too difficult, significantly improving training signal. Online filtering ensures that training batches contain samples with non-zero advantages by continuing to sample rollouts until such a batch is formed.
  • Two-Sided GRPO Clipping: To mitigate training instabilities, especially in larger models, the standard one-sided GRPO clipping is modified. An upper bound δ\delta is introduced for the token probability ratio πθ(oi,t∣q,oi,<t)πθold(oi,t∣q,oi,<t)\frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{\text{old}}(o_{i,t} | q, o_{i,<t})}} even in the case of negative advantages, preventing excessively large gradient updates when moving away from bad rollouts.
  • Mitigating Instability at Scale: The paper identifies and discusses several sources of instability in large-scale RL, including escalating gradient norms, increasing token probability clip ratios, and a characteristic entropy loss pattern (initial decrease followed by increase). Aggressive gradient clipping (thresholds 0.05-0.1) was found effective in mitigating gradient norm escalation and delaying instability [wortsman2023smallscaleproxieslargescaletransformer, chowdhery2022palmscalinglanguagemodeling, molybog2023theoryadaminstabilitylargescale, cohen2024adaptivegradientmethodsedge]. The use of torch.compile was also found to cause instabilities and was disabled.
  • Sequence Packing: To improve efficiency with long sequence lengths (32K max), sequence packing was implemented by adapting the attention mask and collating samples. This is crucial for RL where full samples are required, unlike pretraining.

Experimental results demonstrate the feasibility and performance of the decentralized training setup. The infrastructure components successfully overlapped communication and computation. In the experiments, the shardcast broadcast of 62GB weights averaged 14 minutes (590 Mb/s), toploc validation typically completed within 1 minute, and a batch of verified samples was available within 22-29 minutes after broadcast, ensuring minimal GPU idle time. The ratio of training to inference FLOPs averaged 1:4.5, confirming inference is the dominant compute cost.

INTELLECT-2 showed significant improvement in task rewards on mathematics and coding problems during training. While length penalties also decreased, the model did not fully learn to adhere to the precise thinking budget within the experimental timeframe. Benchmarking against QwQ-32B and other models [deepseekai2025deepseekr1incentivizingreasoningcapability] showed that INTELLECT-2 improved performance on math and coding benchmarks (AIME24, AIME25, LiveCodeBench) while experiencing a slight drop on IFEval, likely due to focused training on math and coding tasks.

The authors discuss the implications of this decentralized approach for the emerging test-time compute scaling paradigm, arguing that asynchronous RL's ability to hide communication and the shift towards inference-heavy workloads make it particularly well-suited for global, heterogeneous compute. Future work directions include exploring RL recipes that increase the inference-to-training compute ratio (e.g., VinePPO [kazemnejad2024vineppounlockingrlpotential]), integrating tool calls, crowdsourcing RL tasks and environments, and applying model merging techniques like DiLoCo [diloco] to scale training further. The trained model, tasks, verifier environments, and the prime-rl framework are open-sourced to foster further research in decentralized AI training.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 11 posts and received 161 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube