Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

DARS-B: Depth-Adaptive & Breadth Scaling in RL

Updated 20 August 2025

DARS-B is a reinforcement learning methodology that combines difficulty-adaptive sampling with large-breadth training to improve reasoning in LLMs.
The method allocates rollouts based on problem difficulty using ET and HW schemes, yielding consistent gains on Pass@K and Pass@1 metrics.
Full-batch gradient descent, as used in DARS-B, provides training stability and high entropy regularization, addressing cumulative advantage bias.

DARS-B is a @@@@1@@@@ methodology developed for the RLVR (Reinforcement Learning with Verifiable Reward) paradigm, specifically targeting the expansion of reasoning capacity in LLMs. Building on the Difficulty Adaptive Rollout Sampling (DARS) technique, DARS-B integrates both depth-adaptive exploration (effective allocation of rollouts to hard problems) and large-breadth training (simultaneously optimizing over many problem instances per iteration), resulting in simultaneous gains on Pass@K and Pass@1 metrics. This section provides a comprehensive technical exposition of the principles, algorithms, empirical outcomes, and implications of DARS-B within the context of advanced LLM training (Yang et al., 19 Aug 2025).

1. Foundation: RLVR and the Cumulative Advantage Bias

RLVR is structured around the verifiable reward paradigm, allowing direct feedback to be incorporated into reinforcement learning for tasks requiring multi-step reasoning. Classical RLVR variants such as GRPO suffer from a training bias caused by their cumulative-advantage mechanism: rollouts are disproportionately weighted toward medium-difficulty samples, under-emphasizing low-accuracy (hard) instances that are most significant for challenging reasoning. This problem restricts the ability of LLMs to demonstrate substantial improvements on the hardest queries encountered during training.

The cumulative advantage for a query $q_j$ is defined as the aggregate of positive rollouts, and its allocation across the training batch directly impacts how gradient magnitude and exploration are distributed. DARS-B is designed to remove this depth-neglect and promote a more balanced gradient landscape.

2. Difficulty-Adaptive Rollout Sampling (DARS)

DARS is the precursor to DARS-B; its key innovation is targeted multi-stage sampling:

For each training query, an initial light rollout is performed, estimating empirical accuracy:

$\hat{a}_j = \frac{1}{k_0} \sum_{i=1}^{k_0} r^{(i)}_j$

where $r^{(i)}_j$ marks success/failure for the $i$ th rollout on $q_j$ .

Difficulty is encoded as $x_j = 1 - \hat{a}_j$ , so hard problems (low accuracy) yield high $x_j$ .
Rollout allocation is then adaptively rebalanced in one of two ways:
- Equal-Treatment (ET): Assigns each problem the cumulative advantage of a median sample ( $\hat{a}=0.5$ ), computing extra rollouts as
$\Delta n_j^{ET} = \min\left(\left\lceil \frac{ \mathcal{A}_\text{group}^N(0.5) - \mathcal{A}_\text{group}^N(\hat{a}_j) }{ \mathcal{S}(\hat{a}_j) } \right\rceil, N^\text{max} \right)$

for each $j$ , with $\mathcal{S}$ a scaling function. - Hardness-Weighted (HW): Allocates rollouts proportional to $x_j$ , using

$\mathcal{A}_\text{group}^{HW}(q_j) = 2(1 - x_j)\mathcal{A}_\text{group}^N(0.5)$

and analogous allocation formulas.
The scaling function $\mathcal{S}(\hat{a})$ is set as either $2\sqrt{\hat{a}(1-\hat{a})}$ or $2\hat{a}(1-\hat{a})$ (for variance or non-variance-based advantage, see eqn. 4 in (Yang et al., 19 Aug 2025)).

DARS ensures increased positive rollouts for hard problems, empirically lifting the Pass@K metric (the probability that, with $K$ rollouts, one is correct), without extra inference cost at convergence. Naive increases to rollout size accelerate convergence but can degrade overall Pass@K by flattening exploration.

3. Breadth Scaling and the DARS-B Algorithm

DARS-B augments DARS with aggressive breadth scaling:

Instead of standard PPO mini-batch optimization, DARS-B switches to full-batch gradient descent over multiple epochs, leveraging large-batch training advantages.
Batch size is maximized per iteration, increasing the number of queries and rollouts included in each update, which further regularizes gradients and maintains token-level entropy during learning.
Algorithmically, DARS-B executes two-phase adaptive sampling (as described in DARS) for depth and switches the optimizer to process the entire generated batch at once for breadth.
Full-batch updates drive training stability, support high-entropy exploration, and mitigate gradient noise.

This dual adaptation allows simultaneous improvement in first-pass accuracy (Pass@1), as well as multi-shot accuracy (Pass@K).

4. Empirical Performance

Experiments using Qwen2.5-Math (1.5B and 7B) quantitatively demonstrate the efficacy of DARS-B:

Pass@K gains are persistent and do not incur additional inference cost at convergence. While naive approaches to rollout or batch size scaling independently offer limited improvements (either in Pass@K or Pass@1 but not both), DARS-B delivers consistent upgrades in both metrics.
The summarized metrics highlight strong average accuracy gains (Avg@128) and stability in Pass@128.
Training dynamics reveal that large-breadth training maintains high entropy regularization (measured at the token-level), preempting premature convergence and supporting robust exploration even as depth-adaptive sampling boosts the capacity to solve hard problems.

5. Depth-Breadth Orthogonality and Exploratory Analysis

Findings validate that depth (ability to train on high-difficulty samples) and breadth (processing many instances per update to regularize training and increase ensemble diversity) operate orthogonally—each dimension introduces distinct improvements:

Depth-adaptive sampling is necessary to overcome the bias toward medium-difficulty samples and is critical for maximizing Pass@K.
Breadth scaling sustains global exploration, higher entropy, and improved Pass@1 without undermining the depth adaptation.
DARS-B achieves a synergy between these dimensions; neither naive depth nor naive breadth scaling alone can realize optimal results.

6. Algorithmic Details and Formulae

The method is instantiated via:

Pre-rollout difficulty assessment and targeted allocation of additional rollouts per question.
Full-batch updates aggregated over all queries and trajectories per PPO epoch.
Dynamic schedule selection (ET or HW) for multi-stage rebalancing, with $\Delta n_j$ computed per question.

Relevant formulae include the empirical accuracy, difficulty score, scaling function, and cumulative advantage-based rollout reallocation.

7. Implications and Future Directions

DARS-B provides a foundation for further advances in reinforcement learning for LLMs:

By correcting for cumulative advantage bias and integrating entropy-regularizing breadth scaling, models can further extend their reasoning capabilities to tougher problem distributions.
Future works could explore application to larger-scale models, more diverse domains (including program synthesis, scientific reasoning, and real-world planning), and dynamic allocation schedules that continue improving efficiency without additional inference cost.
The full-batch training regime's effectiveness as an implicit entropy regularizer may inspire new regularization strategies in RL, broadening the paradigm for self-improving LLM loops.

In summary, DARS-B synthesizes depth-adaptive sampling with high-breadth gradient updates, providing robust and scalable training for RLVR-empowered LLM systems that unlock improvements in both sophisticated and reliable reasoning, substantiated by explicit mathematical foundations and strong empirical confirmation (Yang et al., 19 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration (2025)

Follow Topic

Get notified by email when new papers are published related to DARS-B.

DARS-B: Depth-Adaptive & Breadth Scaling in RL

1. Foundation: RLVR and the Cumulative Advantage Bias

2. Difficulty-Adaptive Rollout Sampling (DARS)

3. Breadth Scaling and the DARS-B Algorithm

4. Empirical Performance

5. Depth-Breadth Orthogonality and Exploratory Analysis

6. Algorithmic Details and Formulae

7. Implications and Future Directions

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DARS-B: Depth-Adaptive & Breadth Scaling in RL

1. Foundation: RLVR and the Cumulative Advantage Bias

2. Difficulty-Adaptive Rollout Sampling (DARS)

3. Breadth Scaling and the DARS-B Algorithm

4. Empirical Performance

5. Depth-Breadth Orthogonality and Exploratory Analysis

6. Algorithmic Details and Formulae

7. Implications and Future Directions

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research