DARS-B: Depth-Adaptive & Breadth Scaling in RL
- DARS-B is a reinforcement learning methodology that combines difficulty-adaptive sampling with large-breadth training to improve reasoning in LLMs.
- The method allocates rollouts based on problem difficulty using ET and HW schemes, yielding consistent gains on Pass@K and Pass@1 metrics.
- Full-batch gradient descent, as used in DARS-B, provides training stability and high entropy regularization, addressing cumulative advantage bias.
DARS-B is a reinforcement learning methodology developed for the RLVR (Reinforcement Learning with Verifiable Reward) paradigm, specifically targeting the expansion of reasoning capacity in LLMs. Building on the Difficulty Adaptive Rollout Sampling (DARS) technique, DARS-B integrates both depth-adaptive exploration (effective allocation of rollouts to hard problems) and large-breadth training (simultaneously optimizing over many problem instances per iteration), resulting in simultaneous gains on Pass@K and Pass@1 metrics. This section provides a comprehensive technical exposition of the principles, algorithms, empirical outcomes, and implications of DARS-B within the context of advanced LLM training (Yang et al., 19 Aug 2025).
1. Foundation: RLVR and the Cumulative Advantage Bias
RLVR is structured around the verifiable reward paradigm, allowing direct feedback to be incorporated into reinforcement learning for tasks requiring multi-step reasoning. Classical RLVR variants such as GRPO suffer from a training bias caused by their cumulative-advantage mechanism: rollouts are disproportionately weighted toward medium-difficulty samples, under-emphasizing low-accuracy (hard) instances that are most significant for challenging reasoning. This problem restricts the ability of LLMs to demonstrate substantial improvements on the hardest queries encountered during training.
The cumulative advantage for a query is defined as the aggregate of positive rollouts, and its allocation across the training batch directly impacts how gradient magnitude and exploration are distributed. DARS-B is designed to remove this depth-neglect and promote a more balanced gradient landscape.
2. Difficulty-Adaptive Rollout Sampling (DARS)
DARS is the precursor to DARS-B; its key innovation is targeted multi-stage sampling:
- For each training query, an initial light rollout is performed, estimating empirical accuracy:
where marks success/failure for the th rollout on .
- Difficulty is encoded as , so hard problems (low accuracy) yield high .
- Rollout allocation is then adaptively rebalanced in one of two ways:
- Equal-Treatment (ET): Assigns each problem the cumulative advantage of a median sample (), computing extra rollouts as
for each , with a scaling function. - Hardness-Weighted (HW): Allocates rollouts proportional to , using
and analogous allocation formulas.
- The scaling function is set as either or (for variance or non-variance-based advantage, see eqn. 4 in (Yang et al., 19 Aug 2025)).
DARS ensures increased positive rollouts for hard problems, empirically lifting the Pass@K metric (the probability that, with rollouts, one is correct), without extra inference cost at convergence. Naive increases to rollout size accelerate convergence but can degrade overall Pass@K by flattening exploration.
3. Breadth Scaling and the DARS-B Algorithm
DARS-B augments DARS with aggressive breadth scaling:
- Instead of standard PPO mini-batch optimization, DARS-B switches to full-batch gradient descent over multiple epochs, leveraging large-batch training advantages.
- Batch size is maximized per iteration, increasing the number of queries and rollouts included in each update, which further regularizes gradients and maintains token-level entropy during learning.
- Algorithmically, DARS-B executes two-phase adaptive sampling (as described in DARS) for depth and switches the optimizer to process the entire generated batch at once for breadth.
- Full-batch updates drive training stability, support high-entropy exploration, and mitigate gradient noise.
This dual adaptation allows simultaneous improvement in first-pass accuracy (Pass@1), as well as multi-shot accuracy (Pass@K).
4. Empirical Performance
Experiments using Qwen2.5-Math (1.5B and 7B) quantitatively demonstrate the efficacy of DARS-B:
- Pass@K gains are persistent and do not incur additional inference cost at convergence. While naive approaches to rollout or batch size scaling independently offer limited improvements (either in Pass@K or Pass@1 but not both), DARS-B delivers consistent upgrades in both metrics.
- The summarized metrics highlight strong average accuracy gains (Avg@128) and stability in Pass@128.
- Training dynamics reveal that large-breadth training maintains high entropy regularization (measured at the token-level), preempting premature convergence and supporting robust exploration even as depth-adaptive sampling boosts the capacity to solve hard problems.
5. Depth-Breadth Orthogonality and Exploratory Analysis
Findings validate that depth (ability to train on high-difficulty samples) and breadth (processing many instances per update to regularize training and increase ensemble diversity) operate orthogonally—each dimension introduces distinct improvements:
- Depth-adaptive sampling is necessary to overcome the bias toward medium-difficulty samples and is critical for maximizing Pass@K.
- Breadth scaling sustains global exploration, higher entropy, and improved Pass@1 without undermining the depth adaptation.
- DARS-B achieves a synergy between these dimensions; neither naive depth nor naive breadth scaling alone can realize optimal results.
6. Algorithmic Details and Formulae
The method is instantiated via:
- Pre-rollout difficulty assessment and targeted allocation of additional rollouts per question.
- Full-batch updates aggregated over all queries and trajectories per PPO epoch.
- Dynamic schedule selection (ET or HW) for multi-stage rebalancing, with computed per question.
Relevant formulae include the empirical accuracy, difficulty score, scaling function, and cumulative advantage-based rollout reallocation.
7. Implications and Future Directions
DARS-B provides a foundation for further advances in reinforcement learning for LLMs:
- By correcting for cumulative advantage bias and integrating entropy-regularizing breadth scaling, models can further extend their reasoning capabilities to tougher problem distributions.
- Future works could explore application to larger-scale models, more diverse domains (including program synthesis, scientific reasoning, and real-world planning), and dynamic allocation schedules that continue improving efficiency without additional inference cost.
- The full-batch training regime's effectiveness as an implicit entropy regularizer may inspire new regularization strategies in RL, broadening the paradigm for self-improving LLM loops.
In summary, DARS-B synthesizes depth-adaptive sampling with high-breadth gradient updates, providing robust and scalable training for RLVR-empowered LLM systems that unlock improvements in both sophisticated and reliable reasoning, substantiated by explicit mathematical foundations and strong empirical confirmation (Yang et al., 19 Aug 2025).