Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (2502.06703v1)

Published 10 Feb 2025 in cs.CL

Abstract: Test-Time Scaling (TTS) is an important method for improving the performance of LLMs by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller LLMs outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

This paper, "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling," investigates how to effectively use Test-Time Scaling (TTS) to improve the performance of LLMs during inference. TTS involves allocating additional computational resources during the inference (testing) phase, as opposed to the training phase, to boost performance. The authors focus on external TTS, where a fixed LLM's reasoning is improved via sampling or search, rather than internal TTS which involves training the LLM to think more deliberately (e.g., longer Chain-of-Thought). A core challenge is determining the optimal amount of computation to allocate for a given problem.

Here's a breakdown of the key components and findings:

1. Problem and Motivation:

  • Existing research on TTS doesn't systematically analyze how different factors influence its effectiveness. These factors include the choice of the "policy model" (the LLM generating solutions), the "Process Reward Model" (PRM, also called a "verifier", which evaluates the quality of intermediate reasoning steps), and the difficulty of the problem.
  • This lack of analysis limits the understanding and practical application of TTS.
  • The paper aims to answer:
    • What's the optimal way to scale test-time computation across different policy models, PRMs, and problem difficulties?
    • Can smaller LLMs, boosted by TTS, outperform much larger LLMs on complex tasks?

2. Problem Formulation (MDP):

  • The reasoning process is formalized as a Markov Decision Process (MDP).
  • A prompt (x) is the initial state (s<sub\>1</sub>).
  • The policy model (π<sub>θ</sub>) generates an action (a<sub>t</sub>), which is a token or sequence of tokens.
  • The state transitions to s<sub>t+1</sub> = s<sub>t</sub>, a<sub>t</sub>.
  • A reward (r<sub>t</sub>) is given based on the state and action.
  • The process continues until termination (maximum steps or <EOS> token).

3. Test-Time Scaling Methods:

The paper considers three main TTS approaches:

  • Best-of-N (BoN): The policy model generates N responses. Scoring (using a PRM) and voting methods are used to select the final answer.
  • Beam Search: Starts by generating N steps. The verifier (PRM) selects the top N/M steps. For each selected step, the model samples M more steps. This continues until a maximum depth or <EOS> is reached.
  • Diverse Verifier Tree Search (DVTS): An extension of beam search designed to increase diversity. The search is divided into N/M independent subtrees, each explored using beam search.

4. Rethinking Compute-Optimal TTS:

  • Reward-Awareness: The authors argue that the compute-optimal TTS strategy should be reward-aware. Previous work often uses a single PRM. However, if the PRM is trained on data from a different policy model than the one used for TTS (an "offline PRM"), it can produce inaccurate rewards due to out-of-distribution (OOD) issues. Since training a PRM for every policy model is expensive, the paper investigates the more general setting where the PRM and policy model might differ. The core idea is that the reward function (from the PRM) significantly influences the selection and search process, so it must be considered when determining the optimal TTS strategy (Equation 4).
  • Problem Difficulty: The authors propose a new way to categorize problem difficulty. Instead of using quantiles of Pass@1 accuracy (as in prior work), they use absolute thresholds of Pass@1 accuracy: easy (50-100%), medium (10-50%), and hard (0-10%). This is because different LLMs have different inherent capabilities, making quantiles less meaningful.

5. Experimental Setup:

  • Datasets: MATH-500 (a subset of the MATH dataset) and AIME24 (a more challenging dataset of math competition problems).
  • Policy Models: LLMs from the Llama 3 and Qwen2.5 families, with varying sizes (0.5B to 72B parameters). Instruct versions are used.
  • Process Reward Models (PRMs): A range of open-source PRMs are evaluated, including Math-Shepherd, RLHFlow series (Mistral and Deepseek based), Skywork series, and Qwen2.5-Math series. These PRMs vary in size (1.5B to 72B) and the data/models they were trained on.
  • Scoring and Voting Methods:
    • Scoring: PRM-Min (minimum reward across all steps), PRM-Last (reward of the last step), PRM-Avg (average reward).
    • Voting: Majority Vote, PRM-Max (select answer with highest score), PRM-Vote (accumulate scores for identical answers, then select the highest).

6. Key Findings and Experiments (Section 4):

  • Q1: TTS Improvement with Different Models and PRMs:
    • PRM Generalization is Difficult: PRMs don't generalize well across different policy models and tasks, especially on more complex tasks like AIME24. The performance of search-based methods is highly dependent on the PRM used.
    • Optimal TTS Method Depends on PRM: BoN often works better with Math-Shepherd and RLHFlow PRMs, while search-based methods are better with Skywork and Qwen2.5-Math PRMs. This highlights the importance of the reward-aware approach. The authors find a positive correlation between a PRM's process supervision ability (measured on ProcessBench) and TTS performance.
    • Optimal TTS Method Varies with Policy Model: For smaller policy models (<7B), search-based methods are better. For larger models, BoN is more effective. Larger models need less step-by-step guidance from a verifier.
  • Q2: TTS Improvement with Different Difficulty Levels:
    • Optimal Method Varies with Difficulty: For smaller models, BoN is better for easy problems, and beam search is better for harder problems. For medium-sized models (7B-32B), DVTS is good for easy/medium problems, and beam search for hard problems. For 72B models, BoN is best across all difficulties.
  • Q3: PRM Biases:
    • Length Bias: Some PRMs are biased towards the length of the steps, influenced by the statistics of their training data. For example, RLHFlow-PRM-Deepseek-8B tends to produce longer outputs than RLHFlow-PRM-Mistral-8B.
    • Voting Method Sensitivity: Some PRMs are more sensitive to the choice of voting method than others. Qwen2.5-Math PRMs are less sensitive due to their training data being processed with LLM-as-a-judge, which helps remove incorrect steps labeled as positive.

7. Key Findings and Experiments (Section 5):

  • Q4: Can Smaller Models Outperform Larger Models?
    • Yes, significantly. A Llama-3.2-3B model with compute-optimal TTS outperforms a Llama-3.1-405B model (135x larger) on both MATH-500 and AIME24. A 1B model can even beat the 405B model on MATH-500 with a larger compute budget (N=512).
    • Smaller models (0.5B, 1.5B, 3B) can surpass GPT-4o's performance.
    • Reasoning-enhanced small models (DeepSeek-R1-Distill-Qwen-1.5B and 7B) can outperform o1 and DeepSeek-R1.
    • FLOPS Comparison: Smaller policy models achieve better results with significantly fewer inference FLOPS, reducing total FLOPS by 100x-1000x.
  • Q5: Comparison with CoT and Majority Voting:
    • Compute-optimal TTS is much more efficient than majority voting (up to 256x) and significantly improves performance over standard CoT (up to 154.6%).
    • The improvement from TTS is larger for models with weaker reasoning abilities.
  • Q6: Comparison with Long-CoT Methods:
    • TTS with Qwen2.5-7B-Instruct outperforms several long-CoT methods (rStar-Math, Eurus-2, SimpleRL, Satori) on both datasets.
    • However, TTS is less effective than distilling from strong reasoning models (like DeepSeek-R1-Distill-Qwen-7B), especially on the more complex AIME24.
    • TTS shows a larger improvement over long-CoT methods that employ direct RL, or SFT on MCTS data, than it does over distilled models.

8. Conclusion and Discussion:

  • The compute-optimal TTS strategy is highly dependent on the policy model, PRM, and problem difficulty.
  • Smaller LLMs can outperform much larger LLMs (and even state-of-the-art models) with appropriate TTS.
  • A 7B PRM can effectively supervise a 72B policy model, suggesting the potential for "weak-to-strong" supervision.
  • Future work should focus on developing more adaptable and universal supervision mechanisms.

9. Appendix:

  • Provides prompt templates used for Llama 3 and Qwen2.5 models.
  • Includes full experimental results.
  • Presents detailed case studies illustrating common problems with PRMs, including over-criticism, error neglect, error localization bias, and scoring bias. These cases demonstrate issues even on in-distribution data.

In essence, the paper demonstrates that intelligently allocating computational resources during inference can be a remarkably powerful way to boost the performance of LLMs, even allowing relatively small models to compete with, and sometimes surpass, much larger and more expensive models. The key is to carefully consider the interplay between the policy model, the reward function (PRM), and the problem's difficulty.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Runze Liu (22 papers)
  2. Junqi Gao (17 papers)
  3. Jian Zhao (218 papers)
  4. Kaiyan Zhang (33 papers)
  5. Xiu Li (166 papers)
  6. Biqing Qi (37 papers)
  7. Wanli Ouyang (358 papers)
  8. Bowen Zhou (141 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com