J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning (2505.10320v1)

Published 15 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

Summary

The paper presents a novel RL framework that incentivizes explicit chain-of-thought reasoning to enhance judgment quality in LLMs.
It converts subjective and verifiable tasks into synthetic training pairs with verifiable rewards, enabling scalable LLM evaluation.
Benchmark evaluations show that both pairwise and pointwise J1 variants outperform existing reward models in consistency and accuracy.

The paper introduces J1, a novel reinforcement learning (RL) approach designed to train LLMs specifically for the role of an impartial judge (LLM-as-a-Judge). The core idea is to incentivize the LLM to generate explicit reasoning steps, often referred to as chain-of-thought (CoT) or "thinking", before making a judgment. This is motivated by the observation that improved reasoning leads to better judgment quality, which is crucial for tasks like evaluating LLM responses during development and deployment.

The primary technical contribution of J1 is a training recipe that converts standard judgment tasks, even subjective ones like evaluating chat responses, into verifiable tasks suitable for RL. This is achieved by constructing synthetic training data consisting of pairs of responses to a given instruction, where one response is designated as "high quality" and the other "low quality". For verifiable tasks (e.g., math problems), the gold answer determines the quality. For non-verifiable tasks (e.g., open-ended chat), lower quality responses are generated by perturbing the input or sampling from a less capable model, ensuring a ground truth preference exists for training purposes. This synthetic data generation process removes the dependency on costly human annotations.

J1 utilizes an online RL algorithm, Group Relative Policy Optimization (GRPO) (2402.03300), to jointly optimize both the generation of thought tokens and the final judgment. The training objective is based on rule-based verifiable rewards. The primary reward is a Verdict Correctness Reward: the model receives a reward of 1 if its judgment (predicting the better response in a pair) matches the ground truth from the synthetic data, and 0 otherwise. To combat positional bias, a common issue in pairwise judgments where the order of responses influences the verdict [2024.acl-long.511, 2024.emnlp-main.474], J1 employs two strategies: training on position-agnostic batches (processing both $(x, a, b)$ and $(x, b, a)$ in the same batch) and incorporating a Verdict Consistency Reward, which grants a reward of 1 only if the model makes the correct judgment for both orderings of the response pair. The paper also experimented with negative rewards for incorrect verdicts and format rewards, but found simpler positive-only rewards for correctness to be most effective.

The paper explores different variants of the J1 training recipe, including:

Pairwise J1 with Verdict (PaV): The primary model, taking $(x, a, b)$ and outputting thought $(t)$ and a verdict $(y)$ .
Pairwise J1 with Scores (PaS): Outputs thought $(t)$ and real-valued scores $(s_a, s_b)$ for each response. The verdict is derived from comparing scores.
Pairwise J1 with Scores+Verdict (PaVS): Outputs thought $(t)$ , scores $(s_a, s_b)$ , and a verdict $(y)$ .
Pointwise J1 (PoS): Takes $(x, a)$ as input and outputs thought $(t)$ and a single score $(s)$ for the response. This variant is inherently position-consistent. Crucially, Pointwise J1 is trained using distant supervision derived from the same pairwise synthetic data used for the pairwise variants, where a reward is assigned if the generated scores for both responses in a pair are consistent with the pairwise preference.

For implementation, J1-Llama models were trained starting from Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct using the 22K synthetic preference pairs (17K WildChat, 5K MATH). Training used GRPO with specific learning rates, batch sizes, and sequence lengths, and included ablations on KL penalty and entropy bonus effects. Inference is performed with greedy decoding or test-time scaling via self-consistency over multiple thoughts/scores [2023.openreview.net].

The evaluation on multiple benchmarks (PPE [2025.openreview.net], RewardBench (2403.13787), JudgeBench [2025.openreview.net], RM-Bench [2025.openreview.net], FollowBenchEval [2024.acl-long.257]) demonstrates J1's effectiveness.

Pairwise J1-Llama-70B achieves state-of-the-art performance on the PPE benchmark, outperforming other generative reward models trained on significantly larger datasets (e.g., DeepSeek-GRM (2504.02495)) and models trained with DPO on the same data (EvalPlanner (2501.18099)).
J1 models, at both 8B and 70B scales, outperform distilled versions of the large DeepSeek-R1 model (2501.12948) across benchmarks. While R1 excels on verifiable tasks, J1 shows superior performance on non-verifiable tasks, positioning it as a strong generalist judge.
Test-time scaling with self-consistency (SC@32) further improves J1's accuracy and consistency.
Comparison of Pointwise-J1 and Pairwise-J1 highlights a trade-off: Pointwise-J1 exhibits significantly better positional consistency (lower verdict flips/ties) than Pairwise-J1, even with bias mitigation techniques applied to the latter. However, Pairwise-J1 tends to produce score distributions with clearer separation between preferred and rejected responses. Training Pointwise-J1 from pairwise supervision is shown to be a viable method for building consistent judges.

In practice, J1 offers a recipe for training powerful LLM-as-a-Judge models that can provide detailed reasoning for their evaluations. The use of synthetic data makes the training scalable, and the RL approach effectively encourages the generation of beneficial intermediate thoughts, such as outlining evaluation criteria, generating reference answers, and comparing responses. The distinct Pairwise and Pointwise variants cater to different evaluation needs, with Pointwise J1 being particularly useful where strict consistency is paramount, even when only pairwise preference data is available. The results suggest that RL with verifiable rewards on well-structured synthetic data is a promising avenue for developing capable and generalist evaluation models.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/jaseweston/status/1923353384926495195

https://twitter.com/HuggingPapers/status/1924195328124731469

https://twitter.com/theomitsa/status/1928740078840385683

https://twitter.com/arxivsanitybot/status/1924461169839845713

https://twitter.com/GptMaestro/status/1936894119965790611

https://twitter.com/theomitsa/status/1928740112889733129

YouTube

Show All Videos