Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 31 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 95 tok/s
GPT OSS 120B 478 tok/s Pro
Kimi K2 223 tok/s Pro
2000 character limit reached

DARS-B: Depth-Adaptive & Breadth Scaling in RL

Updated 20 August 2025
  • DARS-B is a reinforcement learning methodology that combines difficulty-adaptive sampling with large-breadth training to improve reasoning in LLMs.
  • The method allocates rollouts based on problem difficulty using ET and HW schemes, yielding consistent gains on Pass@K and Pass@1 metrics.
  • Full-batch gradient descent, as used in DARS-B, provides training stability and high entropy regularization, addressing cumulative advantage bias.

DARS-B is a reinforcement learning methodology developed for the RLVR (Reinforcement Learning with Verifiable Reward) paradigm, specifically targeting the expansion of reasoning capacity in LLMs. Building on the Difficulty Adaptive Rollout Sampling (DARS) technique, DARS-B integrates both depth-adaptive exploration (effective allocation of rollouts to hard problems) and large-breadth training (simultaneously optimizing over many problem instances per iteration), resulting in simultaneous gains on Pass@K and Pass@1 metrics. This section provides a comprehensive technical exposition of the principles, algorithms, empirical outcomes, and implications of DARS-B within the context of advanced LLM training (Yang et al., 19 Aug 2025).

1. Foundation: RLVR and the Cumulative Advantage Bias

RLVR is structured around the verifiable reward paradigm, allowing direct feedback to be incorporated into reinforcement learning for tasks requiring multi-step reasoning. Classical RLVR variants such as GRPO suffer from a training bias caused by their cumulative-advantage mechanism: rollouts are disproportionately weighted toward medium-difficulty samples, under-emphasizing low-accuracy (hard) instances that are most significant for challenging reasoning. This problem restricts the ability of LLMs to demonstrate substantial improvements on the hardest queries encountered during training.

The cumulative advantage for a query qjq_j is defined as the aggregate of positive rollouts, and its allocation across the training batch directly impacts how gradient magnitude and exploration are distributed. DARS-B is designed to remove this depth-neglect and promote a more balanced gradient landscape.

2. Difficulty-Adaptive Rollout Sampling (DARS)

DARS is the precursor to DARS-B; its key innovation is targeted multi-stage sampling:

  • For each training query, an initial light rollout is performed, estimating empirical accuracy:

a^j=1k0i=1k0rj(i)\hat{a}_j = \frac{1}{k_0} \sum_{i=1}^{k_0} r^{(i)}_j

where rj(i)r^{(i)}_j marks success/failure for the iith rollout on qjq_j.

  • Difficulty is encoded as xj=1a^jx_j = 1 - \hat{a}_j, so hard problems (low accuracy) yield high xjx_j.
  • Rollout allocation is then adaptively rebalanced in one of two ways:

    • Equal-Treatment (ET): Assigns each problem the cumulative advantage of a median sample (a^=0.5\hat{a}=0.5), computing extra rollouts as

    ΔnjET=min(AgroupN(0.5)AgroupN(a^j)S(a^j),Nmax)\Delta n_j^{ET} = \min\left(\left\lceil \frac{ \mathcal{A}_\text{group}^N(0.5) - \mathcal{A}_\text{group}^N(\hat{a}_j) }{ \mathcal{S}(\hat{a}_j) } \right\rceil, N^\text{max} \right)

    for each jj, with S\mathcal{S} a scaling function. - Hardness-Weighted (HW): Allocates rollouts proportional to xjx_j, using

    AgroupHW(qj)=2(1xj)AgroupN(0.5)\mathcal{A}_\text{group}^{HW}(q_j) = 2(1 - x_j)\mathcal{A}_\text{group}^N(0.5)

    and analogous allocation formulas.

  • The scaling function S(a^)\mathcal{S}(\hat{a}) is set as either 2a^(1a^)2\sqrt{\hat{a}(1-\hat{a})} or 2a^(1a^)2\hat{a}(1-\hat{a}) (for variance or non-variance-based advantage, see eqn. 4 in (Yang et al., 19 Aug 2025)).

DARS ensures increased positive rollouts for hard problems, empirically lifting the Pass@K metric (the probability that, with KK rollouts, one is correct), without extra inference cost at convergence. Naive increases to rollout size accelerate convergence but can degrade overall Pass@K by flattening exploration.

3. Breadth Scaling and the DARS-B Algorithm

DARS-B augments DARS with aggressive breadth scaling:

  • Instead of standard PPO mini-batch optimization, DARS-B switches to full-batch gradient descent over multiple epochs, leveraging large-batch training advantages.
  • Batch size is maximized per iteration, increasing the number of queries and rollouts included in each update, which further regularizes gradients and maintains token-level entropy during learning.
  • Algorithmically, DARS-B executes two-phase adaptive sampling (as described in DARS) for depth and switches the optimizer to process the entire generated batch at once for breadth.
  • Full-batch updates drive training stability, support high-entropy exploration, and mitigate gradient noise.

This dual adaptation allows simultaneous improvement in first-pass accuracy (Pass@1), as well as multi-shot accuracy (Pass@K).

4. Empirical Performance

Experiments using Qwen2.5-Math (1.5B and 7B) quantitatively demonstrate the efficacy of DARS-B:

  • Pass@K gains are persistent and do not incur additional inference cost at convergence. While naive approaches to rollout or batch size scaling independently offer limited improvements (either in Pass@K or Pass@1 but not both), DARS-B delivers consistent upgrades in both metrics.
  • The summarized metrics highlight strong average accuracy gains (Avg@128) and stability in Pass@128.
  • Training dynamics reveal that large-breadth training maintains high entropy regularization (measured at the token-level), preempting premature convergence and supporting robust exploration even as depth-adaptive sampling boosts the capacity to solve hard problems.

5. Depth-Breadth Orthogonality and Exploratory Analysis

Findings validate that depth (ability to train on high-difficulty samples) and breadth (processing many instances per update to regularize training and increase ensemble diversity) operate orthogonally—each dimension introduces distinct improvements:

  • Depth-adaptive sampling is necessary to overcome the bias toward medium-difficulty samples and is critical for maximizing Pass@K.
  • Breadth scaling sustains global exploration, higher entropy, and improved Pass@1 without undermining the depth adaptation.
  • DARS-B achieves a synergy between these dimensions; neither naive depth nor naive breadth scaling alone can realize optimal results.

6. Algorithmic Details and Formulae

The method is instantiated via:

  • Pre-rollout difficulty assessment and targeted allocation of additional rollouts per question.
  • Full-batch updates aggregated over all queries and trajectories per PPO epoch.
  • Dynamic schedule selection (ET or HW) for multi-stage rebalancing, with Δnj\Delta n_j computed per question.

Relevant formulae include the empirical accuracy, difficulty score, scaling function, and cumulative advantage-based rollout reallocation.

7. Implications and Future Directions

DARS-B provides a foundation for further advances in reinforcement learning for LLMs:

  • By correcting for cumulative advantage bias and integrating entropy-regularizing breadth scaling, models can further extend their reasoning capabilities to tougher problem distributions.
  • Future works could explore application to larger-scale models, more diverse domains (including program synthesis, scientific reasoning, and real-world planning), and dynamic allocation schedules that continue improving efficiency without additional inference cost.
  • The full-batch training regime's effectiveness as an implicit entropy regularizer may inspire new regularization strategies in RL, broadening the paradigm for self-improving LLM loops.

In summary, DARS-B synthesizes depth-adaptive sampling with high-breadth gradient updates, providing robust and scalable training for RLVR-empowered LLM systems that unlock improvements in both sophisticated and reliable reasoning, substantiated by explicit mathematical foundations and strong empirical confirmation (Yang et al., 19 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)