MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation (2502.12468v1)

Published 18 Feb 2025 in cs.LG and cs.AI

Abstract: The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the LLM to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model's accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.

Summary

The paper introduces MCTS-Judge, a novel framework applying Monte Carlo Tree Search to significantly enhance Large Language Model accuracy in evaluating code correctness through efficient test-time computation.
MCTS-Judge improves evaluation depth and reliability by decomposing complex programming tasks into subproblems guided by a unit-test-level reward mechanism and simulated execution.
Empirical validation shows MCTS-Judge substantially elevates code correctness evaluation performance across multiple benchmarks, achieving significant accuracy gains (e.g., 41% to 80%) while using fewer resources than prior state-of-the-art methods.

The paper "MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation" introduces a novel framework, MCTS-Judge, which applies Monte Carlo Tree Search (MCTS) into LLMs for the task of automatic code correctness evaluation. The LLM-as-a-Judge paradigm, although promising for evaluating generative content, struggles in reasoning-intensive tasks like code accuracy assessment. The authors present MCTS-Judge as a System-2 thinking framework designed to improve the reliability and depth of code evaluation by incorporating test-time computation.

Key Contributions

MCTS-Judge Framework:
- MCTS-Judge utilizes MCTS to decompose complex programming evaluation tasks into simpler subproblems. This allows the LLM to perform evaluations from multiple perspectives, enhancing assessment accuracy.
- The framework introduces a node selection strategy that integrates both historical action self-assessment and the Upper Confidence Bound for Trees (UCT) to strike a balance between global exploration and local trajectory refinement.
High-Precision Reward Mechanism:
- A unit-test-level reward mechanism is designed to guide the LLM through simulated code execution, encouraging detailed line-by-line analysis. This simulated execution utilizes automatic test case synthesis followed by LLM-as-an-interpreter evaluations to align LLM reasoning with expected outcomes.
Empirical Validation:
- Extensive experimentation across several benchmarks—BigCodeBench, HumanEval-X, APPS—demonstrates that MCTS-Judge notably elevates code correctness evaluation performance, enhancing the base model's accuracy significantly (from 41% to 80% in select cases).
- The approach achieves superior correctness assessment compared to prior methods using three times fewer tokens and smaller model sizes than state-of-the-art reasoning models.

Detailed Analysis

Influence of Scaling Laws:
- The methodology capitalizes on the shift in scaling laws interest from model training to introducing computational efficiency at test time, pointing out that progressing model efficacy now lies more in improving utilization during inference rather than parameter expansion during training.
Experimental Setup:
- Evaluation involved five different LLMs, including code-specialized models like Qwen2.5-Coder-14B and DeepSeek-Coder-V2-16B-Instruct, along with general purpose models like Llama-3.1-8B-Instruct.
Superior Reasoning Abilities:
- In comparative studies, MCTS-Judge demonstrated enhanced reasoning capabilities across four dimensions: logic, analytics, thoroughness, and overall quality, positioning it favorably among advanced reasoning LLMs.
Robustness and Generalization:
- In scenarios lacking reference code, MCTS-Judge showed greater robustness and flexibility in adapting to broader, less constrained settings while maintaining accuracy.
Efficiency:
- MCTS-Judge outperformed more extensive reasoning models using fewer computational resources, signaling a shift towards more efficient test-time computations that do not compromise model competency.

The paper concludes by emphasizing MCTS-Judge's ability to integrate LLM reasoning at test time, which signals an essential step towards enhancing LLMs' evaluative capacities in programming tasks without intensifying training costs. Additionally, the findings contribute towards understanding scaling behaviors related to test-time LLM computations, offering insights for future LLM-as-a-Judge enhancement strategies.

PDF Markdown

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation (2502.12468v1)

Summary

Key Contributions

Detailed Analysis

Related Papers