- The paper shows that RL improves initial pass@1 performance but narrows the reasoning boundary at higher pass@k values.
- It employs unbiased pass@k evaluations across math, code, and visual benchmarks to reveal that RL models mainly sample from existing optimal reasoning paths.
- The study contrasts RL with distillation, indicating that distillation more effectively introduces novel reasoning abilities beyond the base model.
This paper "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" (2504.13837) critically re-examines the common assumption that Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of LLMs by enabling them to acquire novel reasoning abilities beyond their base models. The authors investigate this question by using the pass@k metric with large values of k to measure the models' reasoning capacity boundary across various domains (mathematics, code generation, visual reasoning), model families, and RL algorithms.
The core idea is that traditional metrics like pass@1 or average nucleus sampling only capture average-case performance and might underestimate a model's true reasoning potential, which could manifest with more sampling trials. The pass@k metric, which counts a problem as solved if at least one out of k samples is correct, provides a better measure of the reasoning boundary or the proportion of problems the model can solve given enough attempts. The paper uses an unbiased estimator for pass@k to ensure robust evaluation.
The paper conducts extensive experiments comparing base models with their RLVR-trained counterparts. Key findings are presented across different domains:
- Mathematics: Experiments on benchmarks like GSM8K, MATH500, Minerva, Olympiad, AIME24, and AMC23 using models like Qwen-2.5 and LLaMA-3.1 show a consistent trend. While RL-trained models outperform base models at small k (like pass@1), base models catch up and eventually surpass RL-trained models at large k values (tens or hundreds). This suggests that base models have a broader coverage of solvable problems than their RL-enhanced versions, even if they are less efficient at sampling the correct answer initially. Manual analysis of Chain-of-Thought (CoT) confirms that successful solutions at large k in base models often result from valid reasoning paths, not just lucky guesses.
- Code Generation: Evaluating on LiveCodeBench, HumanEval+, and MBPP+ with Code-R1 trained on Qwen-2.5-7B-Instruct, similar trends are observed. RLVR improves pass@1, but the original instruction-tuned base model shows a higher pass@k at large k, indicating a larger potential reasoning scope.
- Visual Reasoning: Using EasyR1 to train Qwen-2.5-VL-7B on visual math problems from filtered MathVista and MathVision, the pattern holds: RLVR boosts initial performance but reduces the coverage of solvable problems compared to the base visual-LLM.
Deep Analysis and Implications:
To understand these findings, the paper performs deeper analysis:
- Reasoning Patterns in Base Models: Comparing the sets of solvable problems at large k, the authors find that problems solvable by the RLVR model are often a subset of those solvable by the base model. Perplexity analysis shows that reasoning paths generated by RLVR models have high likelihood under the base model's distribution. This indicates that RLVR primarily biases the model towards sampling correct paths that already exist within the base model's capabilities, rather than creating new reasoning patterns.
- Sampling Efficiency vs. Reasoning Boundary: RLVR improves sampling efficiency (higher pass@1) by increasing the probability of desirable outputs but reduces the reasoning boundary or coverage of solvable problems at large k. This trade-off is attributed to RL's tendency to reduce output entropy, which limits exploration.
- Comparison with Distillation: In contrast to RLVR, distillation from a stronger teacher model (like DeepSeek-R1 distilled into Qwen-2.5-Math-7B) is shown to genuinely introduce new knowledge and expand the reasoning boundary beyond that of the base model.
- Effect of RL Algorithms and Training Steps: Different RL algorithms show only minor variations in improving sampling efficiency, and none significantly close the gap between the RL model's pass@1 and the base model's pass@k (Sampling Efficiency Gap). Moreover, longer RL training steps, while increasing pass@1 on the training data, can decrease pass@k at large k, suggesting overfitting and further reduction in exploration.
Practical Implementation Considerations:
- Evaluation: The paper highlights the importance of using metrics like pass@k with sufficiently large k to truly assess the potential reasoning capacity rather than just average performance. For math tasks, validating CoT correctness is crucial to avoid attributing lucky guesses to reasoning ability.
- Training Objectives: Current RLVR methods, primarily based on policy gradient maximizing verifiable rewards, seem effective at increasing the probability of already high-likelihood correct responses within the base model's distribution. They do not appear to incentivize the discovery of entirely new reasoning strategies.
- Limitations of Current RLVR: The vast action space of LLMs and the strong prior from pretraining are discussed as potential reasons why RL struggles to explore effectively beyond the base model's distribution. Samples significantly deviating from the prior are likely to be invalid, leading to negative rewards and discouraging exploration of novel paths.
- Alternative Approaches: Distillation from stronger models appears to be a more effective strategy for introducing genuinely new reasoning capabilities and expanding the model's boundary compared to pure RLVR on a base model. Future work may need to explore new RL or training paradigms that can facilitate exploration beyond the constraints of the pretrained prior.
In conclusion, the paper provides strong evidence that current RLVR methods for LLMs primarily act as sampling efficiency boosters for existing capabilities within the base model, rather than enabling the acquisition of novel reasoning skills or expanding the fundamental reasoning boundary. This challenges the view of RLVR as a pathway to continuously self-evolving, transcending reasoning abilities and suggests the need for alternative training paradigms to push the frontier of LLM reasoning.