Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 100 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 41 tok/s

GPT-5 High 33 tok/s Pro

GPT-4o 100 tok/s

GPT OSS 120B 450 tok/s Pro

Kimi K2 220 tok/s Pro

2000 character limit reached

J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization (2505.13346v3)

Published 19 May 2025 in cs.CL and cs.AI

Abstract: To keep pace with the increasing pace of LLMs (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.

Collections

Summary

The paper introduces EIS-GRPO, a novel reinforcement learning algorithm, and J4R, a 7B parameter model, to improve the robustness and accuracy of LLM judges in evaluating reasoning tasks by mitigating positional biases.
A new benchmark called ReasoningJudgeBench is presented, featuring 1,483 diverse and challenging pairwise samples specifically designed to evaluate judges in reasoning-intensive settings.
Experiments show J4R achieves superior performance, outperforming GPT-4o by 6.7% and other small judges by 9% in evaluation accuracy on reasoning benchmarks.

Learning to Judge with Equivalent Initial State Group Relative Preference Optimization: An Overview

The paper entitled "J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization" addresses the shortcomings of current model output evaluations in highly reasoning-intensive domains. As LLMs are increasingly deployed in complex reasoning tasks, the demand for accurate and efficient evaluation methods has grown significantly. Traditional human evaluations, though accurate, are resource-intensive. Automatic evaluations using LLM-judges offer scalability but are hampered by biases and lack of robust reasoning capabilities.

Key Contributions

The authors make three principal contributions to enhance the evaluation capabilities of LLM judges:

Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO): The paper introduces an innovative reinforcement learning algorithm that enhances LLM judges' robustness against positional biases in their assessments. EIS-GRPO enables judges to treat substantively equivalent inputs consistently, thereby reducing random guessing behavior when evaluating high-difficulty tasks.
ReasoningJudgeBench: Recognizing the limitations of existing benchmarks, the authors present ReasoningJudgeBench, a diverse and challenging benchmark specifically designed to assess judges in reasoning-intensive settings. This benchmark comprises 1,483 pairwise samples sourced from various reasoning tasks, offering a comprehensive testing ground for judge models.
Judge for Reasoning (J4R): The authors develop J4R, a 7 billion parameter model trained with EIS-GRPO, demonstrating superior performance compared to GPT-4o and other small judge models, with improvements of 6.7% and 9% in evaluation accuracy.

Experimental Results

The paper provides substantial numerical evidence supporting the efficacy of EIS-GRPO. In the evaluation of J4R, the model consistently outperformed existing large judges trained with standard methods, despite being significantly smaller in size. The results were particularly noteworthy on benchmarks demanding complex reasoning, such as JudgeBench and ReasoningJudgeBench, indicating that EIS-GRPO effectively mitigates biases and enhances evaluation accuracy.

Implications and Future Directions

The implications of this research span both practical and theoretical domains. Practically, the adoption of EIS-GRPO could lead to more reliable automatic evaluations in machine learning systems, enhancing the deployment of LLMs in areas such as education, content generation, and decision support systems where reasoning is paramount. Theoretically, this work prompts further exploration into reinforcement learning approaches tailored to specific evaluation challenges, potentially bridging gaps between generative models and their capability to critique sophisticated reasoning tasks.

Looking ahead, there is potential for further research into refining EIS-GRPO's methodology, exploring its application across different model architectures, and expanding ReasoningJudgeBench to cover more nuanced reasoning tasks. Moreover, similar algorithms could be developed for other types of biases beyond positional ones, thereby improving LLMs' overall reasoning and evaluation capabilities.

In conclusion, this paper highlights a significant advancement in automatic evaluation methodologies for LLMs and offers a robust framework for future research aimed at overcoming existing limitations in model assessments.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (5)

Tweets

https://twitter.com/CaimingXiong/status/1925256124460408977

https://twitter.com/GptMaestro/status/1932744667512385751