Scaling Test-Time Compute Without Verification or RL is Suboptimal (2502.12118v2)

Published 17 Feb 2025 in cs.LG and cs.CL

Abstract: Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time budget grows. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

PDF Abstract

The paper addresses the problem of scaling test-time compute in LLMs and investigates the efficacy of verifier-based (VB) versus verifier-free (VF) methods. The paper provides a theoretical and empirical analysis of these two approaches, highlighting the conditions under which VB methods outperform VF methods.

The central claim is that VB methods, which incorporate verification signals such as rewards or verifiers, scale more effectively with test-time compute than VF methods that rely on distilling or cloning expert traces. This advantage is particularly pronounced when the base pre-trained LLM exhibits a heterogeneous distribution over correct solution traces and a non-sharp distribution over rewards.

Key contributions and concepts include:

Theoretical Framework:
- The paper formalizes the problem of scaling test-time compute, defining metrics to evaluate the performance of finetuned LLMs under a given compute budget $H$ .
- The framework introduces the concept of bi-level rewards, where $r(\tau) = \sum_{h=1}^H r(s_h, a_h)$ where $r(s_h, a_h)$ represents the reward at step $h$ for state $s_h$ and action $a_h$ .
- It introduces the notion of scaling test-time compute by $H^\alpha$ (Definition 1), where $\alpha$ quantifies the asymptotic improvement in performance of one algorithm relative to another as $H$ increases.
Heterogeneity and Anti-Concentration:
- The paper identifies two key properties of pre-trained LLMs that influence the performance of VB and VF methods: heterogeneity and anti-concentration.
  - Heterogeneity, denoted as $\sigma^2_{\pi, x} = \sum_{h=1}^{H} E_{s_h \sim d^{\pi}_{h}}[Var_{a\sim\pi(\cdot \mid s_h)}{Q^{\pi_e}(s_h, a_h)} \mid x]$ , measures the variability in token sequences that lead to correct solutions.
    - $d^\pi_h$ is the distribution over states at time $h$ induced by policy $\pi$ .
    - $Q^{\pi_e}(s_h, a_h)$ is the expected cumulative reward attained by expert LLM $\pi_e$ given state $s_h$ and action $a_h$ .
  - Anti-concentration, denoted as $c_x(\varepsilon) = \text{Pr}_{\tau \sim \pi(\cdot|x)} \Big[ r(\tau) \ge E_{\tau \sim \pi(\cdot|x)}{r(\tau)} + \sigma_{b, x} \sqrt{\varepsilon} \Big]$ , refers to the probability mass that the reward $r(\tau)$ exceeds the mean reward by a margin related to the heterogeneity $\sigma_{b, x}$ .
- These properties characterize the shape and dispersion of the reward distribution induced by the base LLM.
Theoretical Results:
- It is shown that VF methods suffer when the base policy is highly heterogeneous, leading to a suboptimality gap that scales as $\Omega(H/\sqrt{n})$ , where $n$ is the amount of training data.
- Conversely, VB methods can achieve a suboptimality gap of $H/n$ under the anti-concentration condition.
- Theorem 3 formally states that the performance gap between VB and VF methods grows as $\tilde{\Omega}(\frac{H}{\sqrt{n}})$ when the base policy is heterogeneous and anti-concentrated.
Simple Verifier-Based Algorithm:
- The paper introduces a practical VB algorithm (Algorithm 1) that involves training a verifier to predict the correctness of solution traces and finetuning the LLM to maximize verifier scores.
- The algorithm optimizes a pessimistic reward derived from the verifier, mitigating the risk of reward overoptimization.
Empirical Validation:
- The theoretical findings are corroborated through experiments on a contextualized planted subsequence problem and math reasoning problems using 3B/8B Llama models and the S1 model.
- Results demonstrate that VB methods outperform VF methods as test-time compute increases, particularly when the base LLM is heterogeneous and satisfies the anti-concentration condition.
- The paper includes ablation experiments that analyze the effects of varying the data budget, policy heterogeneity, and other factors.
Practical Implications:
- The paper suggests that practitioners should prioritize VB methods for scaling test-time compute, especially when dealing with heterogeneous pre-trained LLMs.
- The findings underscore the importance of incorporating verification signals, such as rewards or trained verifiers, into the finetuning process.

The paper's formal results indicate a need for training verifiers, running RL, or at the very least, using rewards when finetuning LLMs for test-time scaling. The theorems show that VF algorithms are fundamentally limited by the heterogeneity of the base policy, whereas VB algorithms can overcome this limitation by exploiting the anti-concentration property.

The empirical results support the theoretical claims, demonstrating that VB methods achieve superior performance and scaling behavior on both didactic and real-world tasks. Specifically, the experiments confirm that base LLMs exhibit heterogeneous and anti-concentrated reward distributions, and that the gap between VB and VF methods widens as test-time compute and training data increase.