Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Depth-Breadth Synergy in RLVR

Updated 24 August 2025
  • Depth-breadth synergy in RLVR is a framework that balances exploiting known solutions (depth) with exploring diverse reasoning paths (breadth) to enhance model performance.
  • The approach utilizes Difficulty Adaptive Rollout Sampling (DARS) and large-batch training to optimize key metrics like Pass@k and token entropy.
  • These methods overcome cumulative advantage biases in conventional RLVR, enabling robust and scalable reasoning improvements in complex tasks.

Depth-breadth synergy in RLVR describes the mechanism by which reinforcement learning with verifiable rewards (RLVR) unlocks reasoning gains by balancing the efficient exploitation of known solution paths ("depth") and broad, entropy-sustaining exploration of difficult or diverse reasoning tasks ("breadth"). This paradigm is highly relevant for LLMs and agentic systems trained to solve complex reasoning, programming, and simulation tasks. Recent studies systematically dissect and optimize these dimensions, revealing algorithmic biases, proposing adaptive exploration strategies, and quantifying improvements across Pass@k metrics and entropy measures.

1. Definitions and Core Principles

RLVR refers to reinforcement learning schemes for generative models in which externally verifiable criteria (e.g., programmatic correctness, logical validation, or task-specific metrics) define the reward signals. "Depth" in this context corresponds to the hardest (highest-difficulty) problems or longest reasoning chains the model is able to solve, often evaluated by the upper bound of the model's sampling capacity (highest k in Pass@k). "Breadth" is captured by the size of the training instance set, the diversity of solution paths, maintenance of generation entropy, and the range of correct answers achieved across tasks.

Synergy between depth and breadth is not merely the product of their independent effects: adaptive strategies must overcome systematic RLVR biases, which otherwise neglect hard problems and reduce diversity in favor of easy or frequent examples. Depth-breadth synergy thus refers to algorithmic processes and training regimens that explicitly optimize for both dimensions, leading to sustainable and generalizable gains in reasoning capacity.

2. Systematic Bias in Standard RLVR Algorithms

The GRPO (Group Relative Policy Optimization) algorithm is widely used in RLVR settings to compute normalized advantage per token based on group statistics:

Ai,t=rimean({r})std({r})A_{i, t} = \frac{r_i - \text{mean}(\{ r \})}{\text{std}(\{ r \})}

However, studies reveal systematic cumulative-advantage bias: medium-accuracy samples are consistently up-weighted, while hard problems (low-accuracy instances) are down-weighted (Yang et al., 19 Aug 2025). As a result, naïvely increasing rollout size primarily accelerates convergence on easy/mid tasks but fails to push the depth boundary (Pass@k for high k). The lack of positive rollouts for hard problems impairs the model's ability to learn from rare but important solution paths, reducing overall reasoning diversity.

3. Depth Optimization: Difficulty Adaptive Rollout Sampling (DARS)

To rectify depth neglect, Difficulty Adaptive Rollout Sampling (DARS) is introduced. DARS operates via multi-stage rollouts:

  • Pre-rollout difficulty estimation: For each problem q_j, a small rollout yields empirical accuracy a^j\hat{a}_j, with difficulty score xj=1a^jx_j = 1 - \hat{a}_j.
  • Rebalancing schedule: Hard problems (high xjx_j) are allocated additional rollouts Δn_j according to either equal-treatment (ET) or hardness-weighted (HW) schedules:

Δnj(ET/HW)=min(AgroupN(0.5)AgroupN(a^j)S(a^j),Nmax)Δn_j^{(\text{ET/HW})} = \min\left(\left\lceil\frac{\mathcal{A}_{\text{group}}^{N}(0.5) - \mathcal{A}_{\text{group}}^{N}(\hat{a}_j)}{S(\hat{a}_j)}\right\rceil, N^{\text{max}}\right)

where the scaling function S(a^j)S(\hat{a}_j) is defined as 2a^j(1a^j)2\sqrt{\hat{a}_j(1-\hat{a}_j)} for std-based advantages and 2a^j(1a^j)2\hat{a}_j(1-\hat{a}_j) otherwise.

The effect is a dynamic compute allocation, prioritizing additional positive rollouts for the hardest problems and directly de-biasing cumulative advantage. Empirically, DARS consistently enhances Pass@k performance for large k without additional inference cost at convergence (Yang et al., 19 Aug 2025).

4. Breadth Optimization: Large-Batch Training and Token Entropy

Breadth in RLVR is primarily governed by the number of instances sampled per training iteration (batch size) and the degree of diversity maintained in token generation (token-level entropy). Aggressively scaling batch size (e.g., up to 3072) and leveraging full-batch PPO updates over multiple epochs has two effects:

  • Gradient quality: Large batches average over more instances, reducing gradient noise and yielding a clearer optimization direction.
  • Entropy regularization: Full-batch updates act as implicit entropy regularizers, sustaining high token-level entropy (exploration) and counteracting premature convergence or overfitting to frequent solutions.

Empirical token entropy curves and Pass@1 metrics validate that increasing breadth improves single-shot accuracy and maintains exploration, making the model robust to training data expansion (Yang et al., 19 Aug 2025).

5. Combined Depth-Breadth Synergy: Orthogonal and Complementary Gains

Experiments with the DARS-B framework (DARS with large-batch breadth) illustrate that depth and breadth operate as partially orthogonal axes. DARS improves Pass@k (deep reasoning on hard problems), while breadth scaling boosts Pass@1 (first-try success rate). Their combination yields simultaneous and complementary improvements in both metrics. Pass@K curves show that the combined method achieves high density of correct solutions across multiple samples, pushing the frontier of LLM reasoning (Yang et al., 19 Aug 2025).

Additional empirical results indicate:

  • Breadth-naive scaling primarily improves low-difficulty problem performance until convergence.
  • DARS maintains or elevates Pass@K at any fixed Pass@1 value.
  • The best trade-off between exploration and exploitation is achieved when both adaptations are employed concurrently.

6. Theoretical and Practical Implications

Depth-breadth synergy elucidates key limitations in current RLVR frameworks. The findings highlight:

  • The necessity of adaptively allocating computational and learning resources to hard samples.
  • The corrective role of exploration strategies in entropy maintenance, addressing model collapse to high-confidence solutions and maintaining diversity.
  • The importance of balancing exploitation and exploration in large action spaces, particularly for LLMs with strong pretrained priors.
  • That standard RLVR algorithms' neglect of depth and breadth may underlie the gap between efficient reward-based convergence and the broader expansion of reasoning capacity in LLMs.

By formalizing difficulty-based rollouts and large-batch training, researchers establish new practical standards for robust RLVR optimization. The approach is extensible to broader domains and tasks, providing a modular methodology for scalable reasoning improvements in generative models.

7. Limitations and Open Research Directions

While the depth-breadth synergy advanced by DARS and large-batch breadth is empirically robust for Qwen2.5-Math 1.5B/7B models, several considerations remain:

  • Future work should explore adaptive leveraging of additional model scales and architectures.
  • Calibration of difficulty estimation methods (e.g., initial rollout size k0) may be optimized for efficiency and generalization.
  • Balancing memory and compute constraints of full-batch gradient descent requires further engineering research.
  • Extending adaptive sampling and entropy maintenance strategies to continuous and multi-modal RLVR tasks is a promising avenue.

The systematic analysis and method design in (Yang et al., 19 Aug 2025) provide a precise and reproducible blueprint for unlocking reasoning gains in RLVR by combining adaptive depth and breadth exploration in a principled manner.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)