Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators (2503.19877v1)

Published 25 Mar 2025 in cs.CL

Abstract: As LLM (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

Summary

  • The paper demonstrates that increasing evaluation-phase compute via detailed chain-of-thought reasoning improves language model performance on complex reasoning tasks.
  • It introduces reasoning evaluators that generate explicit step-by-step assessments, outperforming larger direct evaluators, including a 72B parameter model on process benchmarks.
  • The study shows that enhanced evaluation compute—using self-consistency and combined process-outcome scores—yields superior candidate solution reranking in diverse problem-solving benchmarks.

This research investigates the hypothesis that increasing computational expenditure during the evaluation phase, analogous to scaling compute during generation, enhances the performance of LLMs (LMs), particularly in complex reasoning tasks. The paper introduces "reasoning evaluators"—LMs prompted to generate chain-of-thought (CoT) reasoning while assessing the quality of another LM's output. This approach contrasts with "direct evaluators" which output a score or classification without explicit reasoning steps. The core proposal is to leverage these reasoning evaluators for both outcome-level and process-level assessment, thereby scaling evaluation-time compute and improving evaluation accuracy.

Methodology: Reasoning Evaluators and Evaluation Frameworks

The paper defines two primary evaluation modes:

  1. Outcome Evaluation: Assesses the final result or correctness of the entire LM-generated response.
  2. Process Evaluation: Examines the intermediate reasoning steps within the response, identifying potential flaws or inaccuracies in the derivation, even if the final outcome is correct (or incorrect despite a partially sound process).

"Reasoning evaluators" are implemented by prompting LMs (specifically, instruction-tuned models like Mixtral-8x22B-Instruct-v0.1) to perform these evaluations. For process evaluation, the prompt guides the evaluator to analyze each paragraph or logical step of the generated solution, producing its own CoT explaining its assessment of that step. For outcome evaluation, the reasoning evaluator assesses the overall correctness and quality, again generating a CoT justification. This contrasts with "direct evaluators," such as dedicated Process Reward Models (PRMs) or Outcome Reward Models (ORMs), which are typically fine-tuned to predict scores directly.

To quantify the benefits, two experimental paradigms are employed:

  1. Meta-Evaluation on ProcessBench: This benchmark assesses an evaluator's ability to pinpoint the first incorrect step in a provided reasoning chain. Performance is measured using the F1 score. This setup directly measures the accuracy of process evaluation. Different evaluator types (direct PRMs, single-step reasoning evaluators, multi-step reasoning process evaluators) and compute scaling techniques (e.g., self-consistency by sampling multiple evaluation CoTs and aggregating via majority vote) are compared.
  2. Best-of-NN Reranking for Problem Solving: This evaluates the downstream impact of improved evaluators on enhancing the problem-solving capabilities of generator LMs. NN candidate solutions are generated for each problem instance using various LMs. Different evaluators are then used to score and rerank these NN solutions, and the top-ranked solution's correctness is measured. This is tested across diverse benchmarks including AIME24, AMC23, Minerva Math, OlympiadBench, MATH500, LeetCode, and GPQA, using metrics like pass@1 and accuracy. Comparisons are made between direct ORMs, direct PRMs, reasoning outcome evaluators, reasoning process evaluators, and a combined approach.

The compute cost is carefully considered. The compute for reasoning evaluators scales with the number of tokens generated during the evaluation CoT. The paper explicitly compares scenarios with fixed compute budgets, allocating resources either towards generating more candidate solutions (increasing NN) or towards more intensive evaluation of fewer candidates (using reasoning evaluators).

Key Findings and Results

The experiments yield several significant findings regarding the efficacy of scaling evaluation-time compute:

  • Monotonic Improvement with Evaluation Compute: Similar to generation, the performance of reasoning evaluators consistently improves as they generate more reasoning tokens. This holds for both process evaluation accuracy on ProcessBench and downstream problem-solving performance in Best-of-NN settings. Increasing the length and detail of the evaluation CoT acts as a method for scaling evaluation compute.
  • Reasoning Evaluators Surpass Specialized Models: A 32B parameter reasoning process evaluator, when generating detailed step-by-step evaluation CoTs, outperforms a 72B parameter state-of-the-art direct PRM trained specifically for process supervision on ProcessBench. This highlights the effectiveness of leveraging the inherent reasoning capabilities of large LMs for evaluation tasks through appropriate prompting, even compared to larger, specialized models.
  • Evaluation Compute vs. Generation Compute: The paper provides evidence that allocating compute towards more sophisticated evaluation can be more effective than simply generating more candidate solutions. In Best-of-NN reranking, using reasoning evaluators to assess N=8N=8 candidates achieved better performance than using direct evaluators on N=64N=64 candidates within a comparable compute budget across several math and coding benchmarks. This suggests that enhancing evaluation quality can be a more compute-efficient strategy for improving final performance, potentially mitigating issues like reward hacking or overoptimization associated with simpler reward models.
  • Process and Outcome Evaluation Complementarity: Reasoning process evaluators and reasoning outcome evaluators provide complementary signals. Process evaluation tends to be more conservative, focusing on step-by-step correctness, while outcome evaluation provides a holistic assessment. Combining judgments from both (e.g., via a weighted sum or heuristic) often yields the best reranking performance, indicating that both the final answer and the method used to reach it are important evaluation criteria. The paper notes the existence of "unfaithful reasoning" (correct answer via incorrect steps), underscoring the need for process-level scrutiny.
  • Effectiveness in Code Verification: Reasoning evaluators demonstrate strong performance in code verification tasks (LeetCode benchmark), outperforming direct evaluation approaches. This suggests the methodology is applicable beyond mathematical reasoning to other domains requiring step-wise verification.
  • Self-Consistency Benefits: Applying self-consistency to the reasoning evaluator (generating multiple evaluation CoTs and using majority voting) further improves evaluation accuracy, providing another axis for scaling evaluation compute.

Implementation Considerations

Implementing reasoning evaluators involves several practical considerations:

  • Prompt Engineering: Crafting effective prompts is crucial. Prompts must clearly instruct the evaluator LM to adopt the evaluator persona, specify whether to perform process or outcome evaluation, define the format for the output (including the CoT justification and the final score/judgment), and potentially provide few-shot examples. For process evaluation, the prompt needs to guide the model to analyze the input solution step-by-step.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    # Example Pseudocode for Process Evaluation Prompt Structure
    prompt = f"""
    You are an expert evaluator assessing the reasoning process in a proposed solution.
    Analyze the following solution step-by-step. For each step (e.g., paragraph), provide a detailed critique, explaining whether the reasoning is sound, contains errors, or lacks clarity.
    Conclude your analysis with an overall assessment of the process quality and identify the first step containing a significant error, if any.
    
    Solution to analyze:
    {solution_text}
    
    Your step-by-step evaluation:
    [Your detailed critique and reasoning for each step]
    
    Overall assessment and first error step:
    [Your summary judgment]
    """
  • Computational Cost: Reasoning evaluators require significant compute, scaling with the length of the solution being evaluated and the length of the generated evaluation CoT. This cost must be balanced against the cost of generating candidate solutions (NN in Best-of-NN). The trade-off favors evaluation compute when evaluator accuracy is paramount or when generation compute is constrained. The inference cost is roughly CevalN×(Tokenssolution+Tokenseval_CoT)C_{eval} \approx N \times (\text{Tokens}_{solution} + \text{Tokens}_{eval\_CoT}).
  • Parsing Evaluator Output: The structured output from the reasoning evaluator (CoT, scores, error localization) needs to be reliably parsed to be used programmatically, for instance, in a reranking mechanism. Designing prompts that yield easily machine-readable output is beneficial.
  • Model Choice: The effectiveness of reasoning evaluators depends heavily on the underlying LM's reasoning capabilities. Larger, more capable instruction-tuned models are likely necessary for complex evaluation tasks.
  • Latency: Generating detailed evaluation CoTs increases latency compared to direct evaluators. This may be a constraint in real-time applications but less critical in offline evaluation or batch processing scenarios.
  • Combining Process and Outcome Scores: A mechanism is needed to aggregate the potentially multi-faceted output of reasoning evaluators (e.g., step-wise scores, overall process score, outcome score) into a single ranking metric for Best-of-NN. This could involve simple averaging, weighted sums, or more complex heuristics based on error identification.

Conclusion

This work demonstrates that scaling compute at evaluation time, specifically by employing LLMs as reasoning process evaluators, significantly enhances evaluation accuracy. This improved evaluation capability translates directly into better problem-solving performance when used for reranking candidate solutions. The findings suggest that investing computational resources in sophisticated, CoT-based evaluation can be a highly effective, and sometimes more efficient, strategy than solely increasing generation-time compute for improving the quality and reliability of LM outputs in complex reasoning domains.