The paper addresses the problem of scaling test-time compute in LLMs and investigates the efficacy of verifier-based (VB) versus verifier-free (VF) methods. The paper provides a theoretical and empirical analysis of these two approaches, highlighting the conditions under which VB methods outperform VF methods.
The central claim is that VB methods, which incorporate verification signals such as rewards or verifiers, scale more effectively with test-time compute than VF methods that rely on distilling or cloning expert traces. This advantage is particularly pronounced when the base pre-trained LLM exhibits a heterogeneous distribution over correct solution traces and a non-sharp distribution over rewards.
Key contributions and concepts include:
- Theoretical Framework:
- The paper formalizes the problem of scaling test-time compute, defining metrics to evaluate the performance of finetuned LLMs under a given compute budget .
- The framework introduces the concept of bi-level rewards, where where represents the reward at step for state and action .
- It introduces the notion of scaling test-time compute by (Definition 1), where quantifies the asymptotic improvement in performance of one algorithm relative to another as increases.
- Heterogeneity and Anti-Concentration:
- The paper identifies two key properties of pre-trained LLMs that influence the performance of VB and VF methods: heterogeneity and anti-concentration.
- Heterogeneity, denoted as , measures the variability in token sequences that lead to correct solutions.
- is the distribution over states at time induced by policy .
- is the expected cumulative reward attained by expert LLM given state and action .
- Anti-concentration, denoted as , refers to the probability mass that the reward exceeds the mean reward by a margin related to the heterogeneity .
- Heterogeneity, denoted as , measures the variability in token sequences that lead to correct solutions.
- These properties characterize the shape and dispersion of the reward distribution induced by the base LLM.
- The paper identifies two key properties of pre-trained LLMs that influence the performance of VB and VF methods: heterogeneity and anti-concentration.
- Theoretical Results:
- It is shown that VF methods suffer when the base policy is highly heterogeneous, leading to a suboptimality gap that scales as , where is the amount of training data.
- Conversely, VB methods can achieve a suboptimality gap of under the anti-concentration condition.
- Theorem 3 formally states that the performance gap between VB and VF methods grows as when the base policy is heterogeneous and anti-concentrated.
- Simple Verifier-Based Algorithm:
- The paper introduces a practical VB algorithm (Algorithm 1) that involves training a verifier to predict the correctness of solution traces and finetuning the LLM to maximize verifier scores.
- The algorithm optimizes a pessimistic reward derived from the verifier, mitigating the risk of reward overoptimization.
- Empirical Validation:
- The theoretical findings are corroborated through experiments on a contextualized planted subsequence problem and math reasoning problems using 3B/8B Llama models and the S1 model.
- Results demonstrate that VB methods outperform VF methods as test-time compute increases, particularly when the base LLM is heterogeneous and satisfies the anti-concentration condition.
- The paper includes ablation experiments that analyze the effects of varying the data budget, policy heterogeneity, and other factors.
- Practical Implications:
- The paper suggests that practitioners should prioritize VB methods for scaling test-time compute, especially when dealing with heterogeneous pre-trained LLMs.
- The findings underscore the importance of incorporating verification signals, such as rewards or trained verifiers, into the finetuning process.
The paper's formal results indicate a need for training verifiers, running RL, or at the very least, using rewards when finetuning LLMs for test-time scaling. The theorems show that VF algorithms are fundamentally limited by the heterogeneity of the base policy, whereas VB algorithms can overcome this limitation by exploiting the anti-concentration property.
The empirical results support the theoretical claims, demonstrating that VB methods achieve superior performance and scaling behavior on both didactic and real-world tasks. Specifically, the experiments confirm that base LLMs exhibit heterogeneous and anti-concentrated reward distributions, and that the gap between VB and VF methods widens as test-time compute and training data increase.