Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
37 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge (2501.18099v1)

Published 30 Jan 2025 in cs.AI and cs.CL

Abstract: LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

Summary

  • The paper introduces EvalPlanner, a novel method that decouples planning and reasoning to improve the evaluation accuracy of LLM outputs.
  • It employs a preference optimization algorithm with self-training loops to generate synthetic evaluation plans, reducing dependence on costly human annotations.
  • Key results on RewardBench and other benchmarks demonstrate state-of-the-art performance and robust generalizability across diverse evaluation tasks.

Learning to Plan for Reason Evaluation with Thinking-LLM-as-a-Judge

The paper introduces EvalPlanner, an advanced approach to refining the LLM-as-a-Judge paradigm, designed to improve the evaluation accuracy of LLMs outputs through a novel structure that decouples planning and reasoning. The main contribution of EvalPlanner is its capacity to enhance existing evaluation strategies by proposing a systematic method that involves generating an evaluation plan, executing this plan, and arriving at a final verdict. This advancement is particularly crucial given the absence of human-annotated chain-of-thoughts (CoTs) for model-based evaluation tasks, which has traditionally limited the efficacy of such models.

EvalPlanner utilizes a preference optimization algorithm to iteratively improve evaluation by employing a self-training loop that synthesizes both evaluation plans and the subsequent executions. The methodology addresses a significant challenge in the domain – the reliable generation of CoTs without extensive human label dependence, which is traditionally costly and domain-specific. By fostering a self-improving cycle, EvalPlanner employs its initial seed models, such as Llama-3.1-70B-Instruct, iteratively enhancing them through systematic self-training leveraging synthetically constructed preference pairs.

Key numerical results demonstrate EvalPlanner's efficacy; it achieves a state-of-the-art performance on RewardBench, with a score of 93.9, outperforming models trained on substantially larger human-annotated datasets. Notably, the model's success extends across multiple benchmarks, such as RM-Bench and JudgeBench, indicating robust generalizability and superiority in evaluating tasks such as multi-level constraints and fine-grained criterion assessments. EvalPlanner's training employs synthetic data, offering an innovative solution that scales effectively without traditional human-intensive annotation processes.

The implications of this research are significant in both practical and theoretical contexts. Practically, EvalPlanner's ability to handle complex, multi-faceted evaluation tasks suggests potential for deployment in diverse real-world scenarios, from education to automated content moderation. Theoretically, the research opens pathways for further investigations into self-training mechanisms and models' reasoning capabilities, fostering developments in LLMs' interpretability and transparency.

Future directions may focus on expanding the model's self-training iterations and exploring additional domains to refine EvalPlanner's versatility further. Such advancements could lead to even more robust frameworks for AI-related evaluation functions. Moreover, the integration of EvalPlanner into RLHF frameworks could revolutionize feedback mechanisms, advancing the calibration and alignment of LLM outputs with human-accepted norms and expectations.

Overall, EvalPlanner presents a compelling advancement in LLM evaluation, underscoring the benefits of planning and reasoning integration for achieving more accurate and transparent AI judgments.