Causal factors behind performance gains in RL-based reasoning models

Determine whether the performance enhancements observed in reinforcement learning–trained Large Reasoning Models (LRMs) are primarily caused by (i) increased exposure to established mathematical benchmark data during training, (ii) the greater inference-time compute allocated to thinking tokens, or (iii) genuine reasoning capabilities developed through reinforcement learning, by isolating and quantifying the contribution of each factor under controlled experimental conditions.

Background

The paper compares thinking models (e.g., Claude-3.7-Sonnet-Thinking, DeepSeek-R1) with their non-thinking counterparts under matched inference compute and observes comparable pass@k on MATH500, but widening gaps on AIME24 and AIME25. These observations raise interpretive challenges about whether differences stem from task complexity, data contamination, or inference compute allocation.

To address confounds present in established math benchmarks, the authors employ controllable puzzle environments that allow manipulation of complexity and inspection of intermediate reasoning traces. Nevertheless, identifying the causal drivers of observed performance improvements in LRMs remains an explicit unresolved question.

References

Currently, it is not clear whether the performance enhancements observed in recent RL-based reasoning (thinking) models are attributable to increased exposure to established mathematical benchmark data, to the significantly greater inference compute allocated to thinking tokens, or to reasoning capabilities developed by RL-based training?

— The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity (2506.06941 - Shojaee et al., 7 Jun 2025) in Section 3 (Math and Puzzle Environments), opening paragraph

Causal factors behind performance gains in RL-based reasoning models

Background

References

Related Problems