- The paper introduces Meta Reinforcement Fine-Tuning (MRT), a novel meta-RL method with a progress-based reward to optimize large language model compute during testing.
- MRT significantly outperforms outcome-reward RL on math reasoning benchmarks, showing up to 3x relative performance gains and roughly 1.5x token efficiency improvements.
- The method trains models to make steady progress during test-time, effectively balancing exploration and exploitation to solve difficult, previously unseen problems.
The paper "Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning" introduces Meta Reinforcement Fine-Tuning (MRT), a new fine-tuning paradigm for optimizing test-time compute in LLMs. The method addresses the challenges of efficiently using test-time compute and scaling to harder problems, which are not adequately handled by current methods like fine-tuning on search traces or outcome-reward Reinforcement Learning (RL).
The authors formalize optimizing test-time compute as a meta-RL problem. They segment the LLM's output stream into multiple episodes and use cumulative regret over output tokens to measure the efficacy of test-time compute. Cumulative regret quantifies the difference between the likelihoods of success of the LLM and an oracle comparator over the output token budget. The goal is to train an LLM that balances exploration and exploitation in the token stream, minimizing cumulative regret on every query, and effectively using test-time compute, regardless of the training token budget.
The key contributions of the paper include:
- Meta-RL Formulation: Formalizing the problem of optimizing test-time compute as a meta-RL problem.
- Cumulative Regret Metric: Introducing cumulative regret as a measure of the efficacy of test-time compute.
Δkμ​(x;π)=Ez∼π(⋅∣x)​[j=0∑k−1​Jr​(x;πj∗​)−Jr​(x;μ(⋅∣x,z0:j​))]
- Δkμ​(x;π) is the cumulative regret.
- x is the problem instance.
- π is the LLM policy.
- z is the output stream of tokens.
- k is the number of episodes.
- Jr​ is the expected 0/1 outcome reward.
- πj∗​ is the optimal comparator policy given a j-episode budget.
- μ is a meta-prover policy.
- z0:j​ are the first j episodes.
- Meta Reinforcement Fine-Tuning (MRT): Developing a new class of fine-tuning methods for optimizing test-time compute by minimizing cumulative regret.
- Progress Reward: Introducing a dense reward bonus based on the "progress" made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success.
rprgμ​(zj​;c):=Jr​(μ(⋅∣zj​,c))−Jr​(μ(⋅∣c))
- rprgμ​(zj​;c) is the progress reward.
- c is the prior context.
- zj​ is the j-th episode.
- Jr​ is the expected 0/1 outcome reward.
- μ is a meta-prover policy.
The paper analyzes state-of-the-art LLMs, specifically derivatives of DeepSeek-R1, and finds that they do not effectively optimize regret, often failing to improve their chances of discovering the right answer with more episodes. The authors find that these models do not make steady "progress" which they claim is critical for discovering solutions on hard unseen problems.
To address these issues, the authors propose a surrogate objective for minimizing regret. They define progress as the advantage of an episode under a meta-prover LLM μ, and prescribe a dense reward bonus for RL training based on this progress. prescribes the following objective, ℓMRT​:
$\!\!\!\!\ell_\mathrm{MRT}(\pi; \pi_\mathrm{old}) := \ell_\mathrm{FT}(\pi) + \alpha \cdot \mathbb{E}_{x \sim \mathcal{D}_\mathrm{train} \left[\sum_{j=0}^{k-1}\mathbb{E}_{c_{j-1} \sim \pi_\mathrm{old}(\cdot|x),~ z_j \sim \pi(\cdot|c_{j-1})}[ {r}_\mathrm{prg}^\mu(z_j; c_{j-1})] \right]$
- ℓMRT​ is the Meta Reinforcement Fine-Tuning objective.
- ℓFT​ is the standard fine-tuning loss function based on the expected final reward.
- α is a weighting coefficient.
- x is the problem instance drawn from the training dataset Dtrain​.
- π is the LLM policy being trained.
- πold​ is the previous LLM checkpoint.
- k is the number of episodes.
- cj−1​ is the context consisting of prefixes produced by πold​.
- zj​ is the j-th episode generated by π.
- rprgμ​ is the progress reward.
The framework is instantiated in two settings: an "open-ended parameterization" where episodes are logical thought blocks enclosed in >
markers, and a "backtracking search" parameterization where the model alternates between complete solution attempts and backtracking. The authors develop STaR and RL variants of to train LLMs to use test-time compute effectively and efficiently.
Empirical evaluations demonstrate that outperforms outcome-reward RL in math reasoning tasks, achieving state-of-the-art results at the 1.5B parameter scale across multiple benchmarks (AIME 2024, AIME 2025, AMC 2023, MinervaMATH, MATH500). The results show a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency compared to outcome-reward RL. also demonstrates improved token efficiency in a linearized evaluation with sliding windows, enhancing token efficiency by over 1.6-1.7x compared to self-correction approaches and outcome-reward training.
Ablation studies show that reduces cumulative regret and makes more steady progress compared to outcome-reward RL. The evolution of length during training with and outcome-reward RL reveals that length oscillates around similar values, with slightly reducing length. Additionally, the benefits of increasing output token budget during training are explained by implicitly improving progress, as a curriculum over output token budget optimizes progress.