e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs (2506.09026v2)

Published 10 Jun 2025 in cs.LG and cs.CL

Abstract: Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

Summary

The paper introduces e3, a recipe that trains LLMs to chain asymmetric skills for effective in-context exploration and compute extrapolation.
The methodology leverages negative gradients in RL to encourage longer, diverse reasoning paths, leading to enhanced pass@k scores and robust performance.
A coupled curriculum aligns training budgets with task difficulty, achieving state-of-the-art results on benchmarks using a Qwen3-1.7B model.

This paper introduces e3 (Explore Enables Extrapolation), a recipe designed to improve the ability of LLMs to extrapolate their performance when given more computational resources (test-time compute) beyond what they were trained on. The authors observe that current LLMs often fail to significantly improve on hard problems even when allowed to "think" for longer at inference time, a capability crucial for realizing the full potential of test-time scaling.

The core idea behind e3 is to train LLMs to perform in-context exploration, meaning the model learns to effectively use its test-time budget by trying multiple reasoning paths, chaining different operations (like generation, verification, refinement), or testing several hypotheses before settling on an answer.

The e3 recipe consists of three key ingredients:

Chaining Asymmetric Skills:
- Concept: LLMs learn to explore effectively when they can chain skills where their competence is asymmetric. A key example is the Verification-Generation (VG) gap, where a model is better at verifying the correctness of an answer (or a step) than generating a correct one from scratch.
- Mechanism: When such asymmetries exist, Reinforcement Learning (RL) can learn to chain these skills (e.g., generate a hypothesis, then verify it, then refine it). This structured search is more beneficial than simply generating longer, unguided responses.
- Formalization: The paper introduces a didactic " $p^k$ model" where an LLM makes $k$ sequential guesses (each with failure probability $p$ ) and uses perfect self-verification. This model shows that asymmetry (perfect verification vs. imperfect generation) is critical for extrapolation by increasing $k$ .
- Evidence: Experiments on tasks like Countdown (Cdown), which has a natural VG gap, show that models chain more verification-generation steps and extrapolate better. In contrast, on tasks like n-digit multiplication (Mult) where the base model has limited verification ability, performance and extrapolation are poor. Fine-tuning the multiplication model to explicitly perform more verification (Mult-V) restores the benefits.
Leveraging Negative Gradients in RL:
- Concept: The "negative gradient" in RL, which arises from penalizing incorrect traces, plays a crucial role in promoting in-context exploration, especially when asymmetries are present.
- Mechanism: Negative gradients discourage short, incorrect reasoning paths. This probability mass is redistributed, partly towards longer traces that chain more asymmetric skills (e.g., adding another verification step). This leads to longer search traces. SFT, lacking negative gradients, mainly reinforces existing correct traces and doesn't incentivize such exploration or length extension.
- Evidence: Experiments comparing standard RL (GRPO) with a variant masking negative gradients (GRPOMask) on Cdown and DMath (a math reasoning dataset) show that negative gradients lead to:
  - Increased response length and more chained asymmetries (e.g., verification attempts).
  - Higher token entropy and more diverse responses, reducing repetitive outputs.
  - Better performance on the training budget and significantly better extrapolation to larger budgets.
- $p^k$ Model Analysis: In the $p^k$ model, negative gradients push down the probability of early stopping $p(\text{stop})$ , increasing the number of attempts $k$ and thus improving performance. They also increase next-action entropy until the correct action $a^*$ is found.
Coupled Curriculum for Structured Exploration:
- Concept: While negative gradients drive exploration, training RL models with very long token budgets can be unstable, and training on hard problems with short budgets can stifle exploration. To address this, e3 proposes a curriculum that couples task difficulty with the training token budget.
- Mechanism: The curriculum progresses from easier tasks with shorter budgets to harder tasks with longer budgets. The key insight for selecting the budget $B_{\text{tr},i}$ for a dataset $D_i$ at stage $i$ is to choose the smallest "RL optimization friendly" budget such that the model can complete most responses and still benefit from chaining more asymmetries if given more compute (e.g., up to $2 \cdot B_{\text{tr},i}$ ). This is formalized by an optimization rule: $B^\star_{\mathrm{tr}}(D_i) = \argmin_{B \geq B_0} B \text{ s.t. } J(\pi_i; D_i, 2 \cdot B) \leq \kappa \cdot J(\pi_i; D_i, B)$ , where $J$ is performance, $\pi_i$ is the model at stage $i$ , and $\kappa > 1$ (e.g., 1.2) is a small factor.
- Evidence:
  - Training on easy DMath problems at a very short budget (4k tokens) achieved good performance at 4k but poor extrapolation, as it penalized longer exploratory traces.
  - Training at a very long budget (16k tokens from the start) suffered from optimization issues. An intermediate budget (8k) provided the best balance for extrapolation on easy problems.
  - Training only on easy problems led to better OOD extrapolation on AIME'25 than training on a mixed (easy+medium+hard) dataset, suggesting that forcing exploration on hard problems with insufficient budget is detrimental.
  - The coupled curriculum (e.g., easy DMath at 8k tokens, then medium/hard DMath at 16k tokens) outperformed single-stage training or curricula that only varied data or budget independently.

Key Results and Contributions:

State-of-the-art 1.7B Model: The e3 recipe, applied to fine-tune a Qwen3-1.7B model on the DeepScaleR dataset (up to a 16k token training budget), achieved the best-known performance for models under 2B parameters on AIME'25 and HMMT'25 benchmarks.
Strong Extrapolation: The e3-1.7B model demonstrated consistent performance improvement when extrapolating test-time compute up to 32k tokens (2x its maximum training budget), outperforming even some larger 7B/32B models in this extrapolation regime.
Improved Pass@k: e3 not only improves pass@1 scores but also pass@k (e.g., pass@32) over the base model, indicating it discovers new solutions through exploration rather than just "sharpening" the distribution around known good solutions.
Principled Recipe: The paper provides a clear, three-part recipe and analyzes each component's role in enabling in-context exploration and test-time compute extrapolation.

Implementation Considerations:

Base Model Choice: The base model should ideally possess or be amenable to developing skill asymmetries (like a VG gap).
RL Algorithm: The paper uses GRPO, but the principles regarding negative gradients are applicable to other policy gradient methods. Attention to hyperparameters like PPO clipping thresholds for off-policy updates is important.
Curriculum Design: The coupled curriculum requires careful staging of data difficulty and training budgets. The provided formula offers a heuristic for budget selection at each stage.
Computational Cost: Training with RL, especially at long sequence lengths and with curricula, can be computationally intensive, requiring multi-GPU setups. The paper notes using H100 GPUs and TPUs.

Practical Implications:

Developing More Capable LLMs: e3 offers a path to train LLMs that can more effectively use increased inference-time compute to solve harder problems.
Beyond Simple Scaling: It highlights that simply increasing training data or model size might not be enough; specific training strategies are needed to unlock extrapolative reasoning.
Understanding RL in LLMs: The work provides insights into the mechanisms of RL fine-tuning, particularly the role of negative gradients and how they interact with model characteristics (asymmetries) and training setup (curriculum).

The paper concludes by discussing broader implications, such as the distinction between simple "sharpening" of model distributions versus genuine in-context exploration, the connection of curricula to dense rewards, and the potential for discovering and imbuing new types of asymmetries in LLMs.

PDF Markdown

Tweets

https://twitter.com/setlur_amrith/status/1933611881010507869

https://twitter.com/fly51fly/status/1932912872243540024

https://twitter.com/iScienceLuvr/status/1932745868647805167

https://twitter.com/JunhongShen1/status/1937312563072827455

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs (2506.09026v2)

Summary

Related Papers

Tweets