When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning (2504.01005v1)

Published 1 Apr 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of LLMs, particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.

PDF Abstract

The paper "When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning" (Singhi et al., 1 Apr 2025 ) investigates the optimal allocation of a fixed inference compute budget for enhancing LLM reasoning capabilities. It specifically contrasts two prominent test-time scaling strategies: Self-Consistency (SC), which involves generating multiple solutions ( $S$ ) and selecting the most frequent answer, and methods employing Generative Reward Models (GenRM), which scale along both solution generation ( $S$ ) and the generation of verification chains-of-thought ( $V$ ) for each solution. The central research question revolves around whether compute is better spent generating more solutions (scaling $S$ via SC) or generating fewer solutions but verifying each more thoroughly (scaling $S$ and $V$ via GenRM).

Compute-Matched Evaluation Framework

To facilitate a fair comparison under budget constraints, the authors propose a compute-matched evaluation framework. Prior work often compared SC and GenRM at a fixed number of solutions ( $S$ ), neglecting the compute cost of generating $V$ verifications per solution in GenRM. This paper introduces a FLOPs-based cost estimation to enable comparisons at equivalent total compute expenditures.

Assuming the same LLM with $P$ parameters generates solutions of average length $T_S$ and verifications of average length $T_V$ , the total inference FLOPs are approximated as proportional to the total number of generated tokens multiplied by the parameter count:

1	FLOPs ∝ 2P(T_S * S + T_V * S * V)

Defining the relative length of verification to solution as $\lambda = T_V / T_S$ , the compute cost $C(S, V)$ can be simplified:

$C(S, V) \propto S \times (1 + \lambda V)$

For SC, where $V=0$ , the cost is $C(S, 0) \propto S$ . For GenRM, both $S$ and $V$ contribute to the cost. The analysis proceeds by evaluating the success rate (SR) of SC (by varying $S$ ) and GenRM (by varying combinations of $S$ and $V$ ) across a spectrum of fixed compute budgets $C$ . The parameter $\lambda$ is empirically estimated or set based on typical generation lengths.

Performance Comparison: Self-Consistency vs. Generative Reward Models

The compute-matched analysis yields several key findings regarding the relative effectiveness of SC and GenRM:

Dominance of SC at Lower Budgets: Across diverse experimental setups (Llama/Qwen models, 8B/70B sizes, MATH/AIME/GPQA datasets), SC consistently demonstrates superior compute efficiency at lower to moderate inference budgets. Generating additional solutions via SC provides a more significant performance uplift per unit of compute compared to allocating that compute towards verification via GenRM. The paper highlights that GenRM often requires substantially more compute, ranging from 4x to 8x, merely to match the SR achieved by SC with a smaller budget.
Advantage of GenRM at Higher Budgets: GenRM begins to outperform SC only when operating at significantly higher compute budgets. This typically occurs in regimes where the performance gains from SC start to saturate (i.e., adding more solutions yields diminishing returns). In such high-budget scenarios, GenRM's ability to leverage verification chains-of-thought to discern correct answers among generated solutions, potentially identifying correct minority solutions or overcoming biases in majority voting, allows it to surpass the peak performance of SC. However, the compute cost associated with achieving these marginal gains over SC can be substantial, potentially requiring 64x to 512x more compute for relatively modest SR improvements (e.g., 1.7% to 5.4%).
Robustness Across Conditions: The observed trend—SC superiority at low budgets and GenRM advantage at high budgets—proved robust across different model families, sizes, instruction-tuned vs. RL-tuned models ("Thinking Models"), reasoning domains (math, science), and problem difficulty levels. While GenRM showed larger relative gains over SC on harder problems (e.g., MATH Level 5), the fundamental compute budget crossover point persisted.
Impact of Verifier Quality: The efficiency of GenRM is heavily influenced by the quality of the verifier. Using a base model as a zero-shot verifier (GenRM-Base) is significantly less compute-efficient than employing a model fine-tuned specifically for the verification task (GenRM-FT). GenRM-FT achieved comparable performance to GenRM-Base with up to 16x less compute, underscoring the value of specialized verifiers if pursuing GenRM strategies.

Inference Scaling Laws for GenRM

Given that GenRM can outperform SC at high compute budgets, the paper investigates the optimal allocation strategy within the GenRM paradigm. Specifically, it derives inference scaling laws to determine how the number of solutions ( $S$ ) and verifications ( $V$ ) should scale as the total compute budget $C$ increases.

Adapting the methodology used for deriving training scaling laws (like Chinchilla), the authors performed the following steps:

Executed GenRM across numerous $(S, V)$ configurations.
Mapped the resulting SR to the corresponding compute cost $C(S, V)$ .
For various fixed compute budgets $C$ , identified the optimal configuration $(S_{opt}, V_{opt})$ that maximized SR.
Fitted power laws to model the relationship between the optimal parameters and the budget:

$S_{opt} \propto C^a$

$V_{opt} \propto C^b$

The consistent finding across different models and datasets was that the exponent for scaling solutions ( $a$ ) is significantly larger than the exponent for scaling verifications ( $b$ ). Typically, $a$ was found to be 1.5 to 2 times larger than $b$ . Specific examples include:

Llama-3.1-8B on MATH: $S_{opt} \propto C^{0.57}$ , $V_{opt} \propto C^{0.39}$
Qwen-7B on MATH: $S_{opt} \propto C^{0.75}$ , $V_{opt} \propto C^{0.32}$
Llama-70B on MATH: $S_{opt} \propto C^{0.69}$ , $V_{opt} \propto C^{0.43}$

This implies that for compute-optimal inference using GenRM, as the budget grows, resources should be allocated to increase the number of solutions $S$ more aggressively than the number of verifications $V$ . While both should increase, prioritizing solution diversity over verification depth per solution yields better performance for a given compute cost.

Practical Implementation Guidance

The paper provides clear, actionable guidance for practitioners seeking to optimize LLM reasoning performance under compute constraints:

Budget Dictates Strategy: The primary factor determining whether to use SC or GenRM is the available inference compute budget. For low-to-moderate budgets, SC is the more compute-efficient choice and likely yields superior performance. GenRM should only be considered for high-budget scenarios where maximizing peak performance is critical and the saturation point of SC has been approached or surpassed.
Optimal GenRM Scaling: When employing GenRM (in high-budget regimes), avoid naive scaling strategies like fixing $V$ and only increasing $S$ . Instead, scale both parameters concurrently according to the derived scaling laws. Prioritize increasing the number of solutions ( $S_{opt}$ ) roughly proportionally to $C^{0.6-0.75}$ and the number of verifications ( $V_{opt}$ ) proportionally to $C^{0.3-0.4}$ .
Fine-Tuned Verifiers: If implementing GenRM, investing in a fine-tuned verification model (GenRM-FT) offers substantial compute efficiency gains compared to using a base model off-the-shelf. This can significantly lower the compute threshold at which GenRM becomes competitive with or superior to SC.

In conclusion, this work provides a rigorous, compute-aware analysis comparing SC and GenRM, demonstrating SC's efficiency at lower budgets and deriving optimal scaling laws for GenRM at higher budgets. The findings emphasize prioritizing solution generation ( $S$ ) over verification depth ( $V$ ) when allocating compute, offering valuable practical guidance for deploying LLM reasoning systems efficiently.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Nishad Singhi (6 papers)
Hritik Bansal (38 papers)
Arian Hosseini (13 papers)
Aditya Grover (82 papers)
Kai-Wei Chang (292 papers)
Marcus Rohrbach (75 papers)
Anna Rohrbach (53 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/arianTBD/status/1907255628256383443

https://twitter.com/hbXNov/status/1933980175471132936

https://twitter.com/hbXNov/status/1907253975604453653

https://twitter.com/fly51fly/status/1907548179882705047

https://twitter.com/arianTBD/status/1919463794083844483

https://twitter.com/arxivsanitybot/status/1907790121036034084