Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 70 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 72 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning (2503.07572v1)

Published 10 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.

Summary

  • The paper introduces Meta Reinforcement Fine-Tuning (MRT), a novel meta-RL method with a progress-based reward to optimize large language model compute during testing.
  • MRT significantly outperforms outcome-reward RL on math reasoning benchmarks, showing up to 3x relative performance gains and roughly 1.5x token efficiency improvements.
  • The method trains models to make steady progress during test-time, effectively balancing exploration and exploitation to solve difficult, previously unseen problems.

The paper "Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning" introduces Meta Reinforcement Fine-Tuning (MRT), a new fine-tuning paradigm for optimizing test-time compute in LLMs. The method addresses the challenges of efficiently using test-time compute and scaling to harder problems, which are not adequately handled by current methods like fine-tuning on search traces or outcome-reward Reinforcement Learning (RL).

The authors formalize optimizing test-time compute as a meta-RL problem. They segment the LLM's output stream into multiple episodes and use cumulative regret over output tokens to measure the efficacy of test-time compute. Cumulative regret quantifies the difference between the likelihoods of success of the LLM and an oracle comparator over the output token budget. The goal is to train an LLM that balances exploration and exploitation in the token stream, minimizing cumulative regret on every query, and effectively using test-time compute, regardless of the training token budget.

The key contributions of the paper include:

  • Meta-RL Formulation: Formalizing the problem of optimizing test-time compute as a meta-RL problem.
  • Cumulative Regret Metric: Introducing cumulative regret as a measure of the efficacy of test-time compute.

    Δkμ(x;π)=Ez∼π(⋅∣x)[∑j=0k−1Jr(x;πj∗)−Jr(x;μ(⋅∣x,z0:j))]{\Delta}^\mu_k(x; \pi) = \mathbb{E}_{z \sim \pi(\cdot|x)} \left[\sum_{j=0}^{k-1} J_r(x; \pi^*_j) - J_r(x; \mu(\cdot | x, z_{0:j})) \right]

    • Δkμ(x;Ï€){\Delta}^\mu_k(x; \pi) is the cumulative regret.
    • xx is the problem instance.
    • Ï€\pi is the LLM policy.
    • zz is the output stream of tokens.
    • kk is the number of episodes.
    • JrJ_r is the expected 0/1 outcome reward.
    • Ï€j∗\pi^*_j is the optimal comparator policy given a jj-episode budget.
    • μ\mu is a meta-prover policy.
    • z0:jz_{0:j} are the first jj episodes.
  • Meta Reinforcement Fine-Tuning (MRT): Developing a new class of fine-tuning methods for optimizing test-time compute by minimizing cumulative regret.
  • Progress Reward: Introducing a dense reward bonus based on the "progress" made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success.

    rprgμ(zj;c):=Jr(μ(⋅∣zj,c))−Jr(μ(⋅∣c)){r}_\mathrm{prg}^\mu(z_j; \mathbf{c}) := J_r(\mu(\cdot| z_j, \mathbf{c})) - J_r(\mu(\cdot|\mathbf{c}))

    • rprgμ(zj;c){r}_\mathrm{prg}^\mu(z_j; \mathbf{c}) is the progress reward.
    • c\mathbf{c} is the prior context.
    • zjz_j is the jj-th episode.
    • JrJ_r is the expected 0/1 outcome reward.
    • μ\mu is a meta-prover policy.

The paper analyzes state-of-the-art LLMs, specifically derivatives of DeepSeek-R1, and finds that they do not effectively optimize regret, often failing to improve their chances of discovering the right answer with more episodes. The authors find that these models do not make steady "progress" which they claim is critical for discovering solutions on hard unseen problems.

To address these issues, the authors propose a surrogate objective for minimizing regret. They define progress as the advantage of an episode under a meta-prover LLM μ\mu, and prescribe a dense reward bonus for RL training based on this progress. prescribes the following objective, ℓMRT\ell_\mathrm{MRT}:

$\!\!\!\!\ell_\mathrm{MRT}(\pi; \pi_\mathrm{old}) := \ell_\mathrm{FT}(\pi) + \alpha \cdot \mathbb{E}_{x \sim \mathcal{D}_\mathrm{train} \left[\sum_{j=0}^{k-1}\mathbb{E}_{c_{j-1} \sim \pi_\mathrm{old}(\cdot|x),~ z_j \sim \pi(\cdot|c_{j-1})}[ {r}_\mathrm{prg}^\mu(z_j; c_{j-1})] \right]$

  • â„“MRT\ell_\mathrm{MRT} is the Meta Reinforcement Fine-Tuning objective.
  • â„“FT\ell_\mathrm{FT} is the standard fine-tuning loss function based on the expected final reward.
  • α\alpha is a weighting coefficient.
  • xx is the problem instance drawn from the training dataset Dtrain\mathcal{D}_\mathrm{train}.
  • Ï€\pi is the LLM policy being trained.
  • Ï€old\pi_\mathrm{old} is the previous LLM checkpoint.
  • kk is the number of episodes.
  • cj−1c_{j-1} is the context consisting of prefixes produced by Ï€old\pi_\mathrm{old}.
  • zjz_j is the jj-th episode generated by Ï€\pi.
  • rprgμ{r}_\mathrm{prg}^\mu is the progress reward.

The framework is instantiated in two settings: an "open-ended parameterization" where episodes are logical thought blocks enclosed in > markers, and a "backtracking search" parameterization where the model alternates between complete solution attempts and backtracking. The authors develop STaR and RL variants of to train LLMs to use test-time compute effectively and efficiently.

Empirical evaluations demonstrate that outperforms outcome-reward RL in math reasoning tasks, achieving state-of-the-art results at the 1.5B parameter scale across multiple benchmarks (AIME 2024, AIME 2025, AMC 2023, MinervaMATH, MATH500). The results show a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency compared to outcome-reward RL. also demonstrates improved token efficiency in a linearized evaluation with sliding windows, enhancing token efficiency by over 1.6-1.7x compared to self-correction approaches and outcome-reward training.

Ablation studies show that reduces cumulative regret and makes more steady progress compared to outcome-reward RL. The evolution of length during training with and outcome-reward RL reveals that length oscillates around similar values, with slightly reducing length. Additionally, the benefits of increasing output token budget during training are explained by implicitly improving progress, as a curriculum over output token budget optimizes progress.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 posts and received 50 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube