Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration (2508.13755v1)

Published 19 Aug 2025 in cs.LG and cs.AI

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in LLMs, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO's mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces Difficulty Adaptive Rollout Sampling (DARS) to counteract depth bias in GRPO, reallocating compute to hard problems for improved Pass@K.
It demonstrates that scaling training breadth enhances Pass@1 and sustains exploration via implicit regularization and dynamic batch-size adjustments.
The DARS-B framework synergizes depth and breadth optimization, achieving simultaneous gains in Pass@1 and Pass@K through full-batch gradient descent.

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Introduction

This paper presents a systematic analysis and novel methodology for enhancing the reasoning capabilities of LLMs via Reinforcement Learning with Verifiable Reward (RLVR). The authors identify two critical, under-explored dimensions in RLVR optimization: Depth (the hardest problem a model can learn to solve) and Breadth (the number of instances processed per iteration). The work reveals a fundamental bias in the widely adopted GRPO algorithm, which disproportionately weights medium-difficulty samples and neglects high-difficulty ones, thereby capping Pass@K performance. To address this, the paper introduces Difficulty Adaptive Rollout Sampling (DARS) and its breadth-augmented variant DARS-B, demonstrating that depth and breadth are orthogonal and synergistic axes for RLVR optimization.

Figure 1: Training dynamics of Pass@1 and Pass@K performance. DARS significantly improves Pass@K and is complementary to large breadth scaling for Pass@1.

Depth and Breadth in RLVR: Empirical Analysis

Depth: Hardest Problem Sampled

Depth is defined as the hardest problem that can be correctly answered during RLVR training. The GRPO algorithm and its variants estimate sample advantage by normalizing binary rewards, but their cumulative advantage calculation is biased toward medium-difficulty problems. This is formalized as:

$\mathcal{A}_{\text{group}}^{std} = 2N\sqrt{u(1-u)}, \quad \mathcal{A}_{\text{group}}^{nostd} = 2Nu(1-u)$

where $u$ is the group accuracy and $N$ is the rollout size. The functional curves (Figure 2) show that both methods underestimate high-difficulty problems, resulting in gradient vanishing for groups with all incorrect rollouts.

Figure 2: Cumulative advantage underestimates high-difficulty problems, limiting Pass@K improvement.

Naively increasing rollout size accelerates convergence but does not consistently improve Pass@K and can even degrade it for smaller models (Figure 3).

Figure 3: Pass@1 and Pass@K dynamics for Qwen2.5-Math-1.5b/7b with different rollout sizes.

Breadth: Instance Quantity per Iteration

Breadth refers to the number of instances processed per RLVR iteration. Increasing batch size (e.g., from 128 to 3072) improves Pass@1 performance across models but may harm Pass@K for smaller models (Figure 4). Large breadth also sustains higher token entropy, acting as implicit regularization and preventing premature convergence (Figure 5).

Figure 4: Pass@1 and Pass@K dynamics for Qwen2.5-Math-1.5b/7b with different batch sizes.

Figure 5: Pass@1 performance and token entropy for Qwen2.5-Math-1.5b/7b. Large breadth sustains exploration.

Methodology: DARS and DARS-B

Difficulty Adaptive Rollout Sampling (DARS)

DARS operates in two phases (Figure 6):

Pre-Rollout Difficulty Estimation: For each question, a lightweight rollout estimates empirical accuracy $\hat{a}_j$ , with difficulty score $x_j = 1 - \hat{a}_j$ .
Multi-Stage Rollout Re-Balancing: Additional trajectories $\Delta n_j$ are allocated to low-accuracy (hard) problems, rebalancing cumulative advantage to up-weight hard samples.

Two schedules are proposed:

Equal-Treatment (ET): Raises cumulative advantage of all difficult problems to the medium-difficulty level.
Hardness-Weighted (HW): Allocates more rollouts to lower-accuracy problems, with cumulative advantage as a monotonically increasing function of difficulty.
Figure 6: DARS training framework: pre-rollout difficulty estimation and re-balancing rollout stage. Breadth scaling via full-batch PPO epochs.

Depth-Breadth Synergy: DARS-B

DARS-B integrates DARS with large-breadth training by replacing PPO mini-batch updates with full-batch gradient descent across multiple epochs. This enables dynamic batch-size adjustments and maximizes effective training breadth. Full-batch training eliminates gradient noise and sustains token-level exploration, acting as regularization. DARS-B achieves simultaneous gains in Pass@1 and Pass@K, confirming the orthogonality and synergy of depth and breadth.

Experimental Results

Experiments are conducted on Qwen2.5-Math-1.5b and 7b models using five mathematical reasoning benchmarks. The evaluation protocol samples 128 candidate responses per question, reporting Pass@1 and Pass@128 metrics.

Breadth scaling (Breadth-Naive) consistently improves Pass@1.
DARS improves Pass@K by reallocating compute to hard problems.
DARS-B achieves the highest Pass@1 and matches top Pass@K scores, demonstrating complementary strengths.

Training dynamics show that Pass@128 peaks quickly and then declines with over-training, while DARS achieves the highest peak and training efficiency (Figure 7, Figure 8).

Figure 7: Pass@128 dynamics with different training steps for Qwen2.5-Math-1.5b/7b.

Figure 8: Pass@32/Pass@128 and Pass@1 dynamics for Qwen2.5-Math-1.5b/7b.

Depth and breadth are shown to be complementary; DARS-B lies on the outermost envelope in Pass@1–Pass@K curves (Figure 9).

Figure 9: Depth and Breadth synergy: DARS-B achieves simultaneous improvement in Pass@1 and Pass@K.

Implications and Future Directions

The findings have significant implications for RLVR-based LLM optimization. The identification of depth bias in GRPO and the introduction of DARS provide a principled approach to overcoming the Pass@K bottleneck. Breadth scaling is shown to be a powerful, orthogonal axis for improving Pass@1 and sustaining exploration. The synergy of depth and breadth in DARS-B suggests that future RLVR pipelines should jointly optimize both dimensions.

Potential future directions include:

Extending DARS-B to larger LLMs and other reasoning domains (e.g., code synthesis, scientific QA).
Investigating adaptive schedules for rollout allocation and batch size scaling.
Exploring the interaction of DARS-B with other RLVR variants and reward formulations.
Analyzing the generalization of depth-breadth synergy to non-verifiable reward settings.

Conclusion

This work systematically analyzes and addresses the limitations of current RLVR algorithms by introducing depth and breadth as orthogonal, synergistic dimensions for LLM reasoning optimization. The proposed DARS and DARS-B frameworks reallocate compute to hard problems and scale training breadth, achieving simultaneous improvements in Pass@1 and Pass@K. These results provide a foundation for more effective RLVR pipelines and open avenues for further research in adaptive exploration and scalable LLM training.

PDF Markdown

Follow-up Questions

Related Papers

Authors (8)

Tweets

https://twitter.com/iScienceLuvr/status/1958092835665977806

https://twitter.com/fly51fly/status/1959736256231014589

alphaXiv

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration (31 likes, 0 questions)