The paper "When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning" (Singhi et al., 1 Apr 2025 ) investigates the optimal allocation of a fixed inference compute budget for enhancing LLM reasoning capabilities. It specifically contrasts two prominent test-time scaling strategies: Self-Consistency (SC), which involves generating multiple solutions () and selecting the most frequent answer, and methods employing Generative Reward Models (GenRM), which scale along both solution generation () and the generation of verification chains-of-thought () for each solution. The central research question revolves around whether compute is better spent generating more solutions (scaling via SC) or generating fewer solutions but verifying each more thoroughly (scaling and via GenRM).
Compute-Matched Evaluation Framework
To facilitate a fair comparison under budget constraints, the authors propose a compute-matched evaluation framework. Prior work often compared SC and GenRM at a fixed number of solutions (), neglecting the compute cost of generating verifications per solution in GenRM. This paper introduces a FLOPs-based cost estimation to enable comparisons at equivalent total compute expenditures.
Assuming the same LLM with parameters generates solutions of average length and verifications of average length , the total inference FLOPs are approximated as proportional to the total number of generated tokens multiplied by the parameter count:
1 |
FLOPs ∝ 2P(T_S * S + T_V * S * V) |
Defining the relative length of verification to solution as , the compute cost can be simplified:
For SC, where , the cost is . For GenRM, both and contribute to the cost. The analysis proceeds by evaluating the success rate (SR) of SC (by varying ) and GenRM (by varying combinations of and ) across a spectrum of fixed compute budgets . The parameter is empirically estimated or set based on typical generation lengths.
Performance Comparison: Self-Consistency vs. Generative Reward Models
The compute-matched analysis yields several key findings regarding the relative effectiveness of SC and GenRM:
- Dominance of SC at Lower Budgets: Across diverse experimental setups (Llama/Qwen models, 8B/70B sizes, MATH/AIME/GPQA datasets), SC consistently demonstrates superior compute efficiency at lower to moderate inference budgets. Generating additional solutions via SC provides a more significant performance uplift per unit of compute compared to allocating that compute towards verification via GenRM. The paper highlights that GenRM often requires substantially more compute, ranging from 4x to 8x, merely to match the SR achieved by SC with a smaller budget.
- Advantage of GenRM at Higher Budgets: GenRM begins to outperform SC only when operating at significantly higher compute budgets. This typically occurs in regimes where the performance gains from SC start to saturate (i.e., adding more solutions yields diminishing returns). In such high-budget scenarios, GenRM's ability to leverage verification chains-of-thought to discern correct answers among generated solutions, potentially identifying correct minority solutions or overcoming biases in majority voting, allows it to surpass the peak performance of SC. However, the compute cost associated with achieving these marginal gains over SC can be substantial, potentially requiring 64x to 512x more compute for relatively modest SR improvements (e.g., 1.7% to 5.4%).
- Robustness Across Conditions: The observed trend—SC superiority at low budgets and GenRM advantage at high budgets—proved robust across different model families, sizes, instruction-tuned vs. RL-tuned models ("Thinking Models"), reasoning domains (math, science), and problem difficulty levels. While GenRM showed larger relative gains over SC on harder problems (e.g., MATH Level 5), the fundamental compute budget crossover point persisted.
- Impact of Verifier Quality: The efficiency of GenRM is heavily influenced by the quality of the verifier. Using a base model as a zero-shot verifier (GenRM-Base) is significantly less compute-efficient than employing a model fine-tuned specifically for the verification task (GenRM-FT). GenRM-FT achieved comparable performance to GenRM-Base with up to 16x less compute, underscoring the value of specialized verifiers if pursuing GenRM strategies.
Inference Scaling Laws for GenRM
Given that GenRM can outperform SC at high compute budgets, the paper investigates the optimal allocation strategy within the GenRM paradigm. Specifically, it derives inference scaling laws to determine how the number of solutions () and verifications () should scale as the total compute budget increases.
Adapting the methodology used for deriving training scaling laws (like Chinchilla), the authors performed the following steps:
- Executed GenRM across numerous configurations.
- Mapped the resulting SR to the corresponding compute cost .
- For various fixed compute budgets , identified the optimal configuration that maximized SR.
- Fitted power laws to model the relationship between the optimal parameters and the budget:
The consistent finding across different models and datasets was that the exponent for scaling solutions () is significantly larger than the exponent for scaling verifications (). Typically, was found to be 1.5 to 2 times larger than . Specific examples include:
- Llama-3.1-8B on MATH: ,
- Qwen-7B on MATH: ,
- Llama-70B on MATH: ,
This implies that for compute-optimal inference using GenRM, as the budget grows, resources should be allocated to increase the number of solutions more aggressively than the number of verifications . While both should increase, prioritizing solution diversity over verification depth per solution yields better performance for a given compute cost.
Practical Implementation Guidance
The paper provides clear, actionable guidance for practitioners seeking to optimize LLM reasoning performance under compute constraints:
- Budget Dictates Strategy: The primary factor determining whether to use SC or GenRM is the available inference compute budget. For low-to-moderate budgets, SC is the more compute-efficient choice and likely yields superior performance. GenRM should only be considered for high-budget scenarios where maximizing peak performance is critical and the saturation point of SC has been approached or surpassed.
- Optimal GenRM Scaling: When employing GenRM (in high-budget regimes), avoid naive scaling strategies like fixing and only increasing . Instead, scale both parameters concurrently according to the derived scaling laws. Prioritize increasing the number of solutions () roughly proportionally to and the number of verifications () proportionally to .
- Fine-Tuned Verifiers: If implementing GenRM, investing in a fine-tuned verification model (GenRM-FT) offers substantial compute efficiency gains compared to using a base model off-the-shelf. This can significantly lower the compute threshold at which GenRM becomes competitive with or superior to SC.
In conclusion, this work provides a rigorous, compute-aware analysis comparing SC and GenRM, demonstrating SC's efficiency at lower budgets and deriving optimal scaling laws for GenRM at higher budgets. The findings emphasize prioritizing solution generation () over verification depth () when allocating compute, offering valuable practical guidance for deploying LLM reasoning systems efficiently.