Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 61 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Pairwise J1 with Verdict (PaV)

Updated 25 August 2025

The paper introduces a novel pairwise reinforcement learning approach that leverages chain-of-thought reasoning to directly compare candidate responses and mitigate positional bias.
It employs Group Relative Policy Optimization (GRPO) to jointly optimize intermediate reasoning tokens and final binary verdicts, ensuring accuracy and consistency.
Empirical results demonstrate superior benchmark performance in PPE accuracy and order-invariant judgments compared to traditional pointwise methods.

Pairwise J1 with Verdict (PaV) refers to a reinforcement learning methodology for training LLMs to perform comparative judgment tasks with high fidelity, particularly in the domain of LLM-as-a-Judge applications. Central to PaV is the explicit contrast of paired responses under a common instruction, incentivizing reasoning and mitigating bias through reward shaping and consistency constraints. The approach incorporates chain-of-thought reasoning, synthetic data generation, and positional robustness, leading to superior performance on evaluation benchmarks and more principled automation of subjective and objective judgment tasks.

1. Pairwise and Pointwise Judgment Models

Pairwise J1 with Verdict (PaV) operates by accepting an instruction $x$ and a pair of candidate responses $(a, b)$ , outputting both intermediate “thinking” tokens $t$ (representing the model’s reasoning) and a final binary verdict $y$ indicating which response is preferred. Formally, the model executes:

$\text{prompt}_{pav}(x, a, b) \to (t, y)$

In contrast, the Pointwise-J1 model receives a single response (along with $x$ ), generating a real-valued score $s$ and its own chain-of-thought:

$\text{prompt}_{pos}(x, a) \to (t, s)$

The pairwise format enables direct comparison between candidate responses, resolving judgment in a context-sensitive manner. Position-agnostic training—where both orderings $(x, a, b)$ and $(x, b, a)$ are presented—serves to counteract known positional biases: the tendency for verdicts to favor the response presented in a specific order. The imposition of consistency rewards further encourages the model to remain invariant to response order, a property not guaranteed by pointwise approaches.

2. Reinforcement Learning Optimization Via GRPO

PaV employs an online reinforcement learning framework using Group Relative Policy Optimization (GRPO). During optimization, the model generates both a chain-of-thought $t$ and a final verdict $y$ . Two primary reward components are utilized:

Verdict Correctness Reward: $R_1 = 1$ if the model’s verdict matches the gold label, $0$ otherwise.
Verdict Consistency Reward: An additional reward (typically $+1$ ) is granted if correct verdicts are produced for both $(x, a, b)$ and $(x, b, a)$ .

The joint reward is thus:

$R_{total} = \begin{cases} 1 + \delta, & \text{if } y = y^* \text{ and verdict consistent across orders} \ 0, & \text{otherwise} \end{cases}$

GRPO permits direct optimization over both $t$ and $y$ without a separate critic, employing group scores as baselines. This design ensures that the model is incentivized not only to produce the correct verdict but also to substantiate its reasoning—thereby fostering more reliable judgment.

3. Evaluation Criteria and Chain-of-Thought Generation

Evaluation centers on the model’s ability to “think” before deciding. Each PaV instance invites the model to first outline relevant evaluation criteria—occasionally prompting reference answer generation—prior to comparison. Synthetic training data, spanning both verifiable (e.g., mathematics) and non-verifiable (e.g., conversational preferences) tasks, is used to reframe subjective judgment as a verifiable, rewardable challenge.

Thought tokens are structured to include explicit tags or criteria specification. Benchmark evaluations (PPE, JudgeBench, RM-Bench) measure both verdict accuracy and positional consistency, where the latter indicates reliability under response order permutation.

PaV models are observed to produce longer chains-of-reasoning (typically ~500 tokens) than pointwise models (~300–400 tokens), with longer chains correlating to more robust verdicts. Variations in seed prompt design and reward shape (e.g., exclusive positive rewards vs. mixed reward/penalty) are found to modulate verdict reliability.

4. Empirical Performance and Benchmark Results

Table: Benchmark Performance Comparisons

Model/Version	PPE Accuracy (%)	JudgeBench Consistency (%)	Median Thought Length
J1-Llama-70B PaV	~69.6	up to +14 over baselines	~500 tokens
EvalPlanner	lower	lower	--
DeepSeek-GRM	lower	lower	--
OpenAI-o1-mini	lower	lower	--
DeepSeek-R1	lower (some tasks)	lower	--

The J1-Llama-70B PaV model achieves leading PPE accuracy and position-consistent performance, outperforming EvalPlanner, DeepSeek-GRM, existing Thinking-LLMs (OpenAI-o1-mini, DeepSeek-R1), and even larger baselines on several non-verifiable judgment benchmarks. Notably, test-time “self-consistency” techniques—sampling multiple chain-of-thought generations—are shown to further improve outcome stability by reducing ties or inconsistencies in verdict output.

5. Judgment Bias, Consistency, and Model Robustness

Positional bias—where a model’s verdict depends on response order—is a principal concern in pairwise judgment. PaV mitigates this through training batch construction containing both $(x, a, b)$ and $(x, b, a)$ instances, rewarding only verdicts that are correct and order-invariant.

Pointwise-J1, scoring responses independently, does not manifest positional bias; however, it can result in verdict ties (identical scores for different responses). Ablation studies indicate that while pointwise models excel in positional consistency, pairwise models are superior in absolute accuracy.

6. Training Protocols: Data Synthesis, Offline and Online RL

PaV training initiates with synthetic datasets (~22K preference pairs), derived from WildChat (conversational) and MATH (mathematical) domains; this removes reliance on expensive human annotation. Subsequent online RL training (GRPO) jointly optimizes chain-of-thought and verdict output, integrating verifiable and consistency-driven rewards.

A plausible implication is that synthetic data, if constructed with sufficient diversity and verifiability, enables scalable bootstrapping of robust LLM-as-a-Judge models for both objective and subjective tasks. Careful prompt and reward scheme selection is critical for maximizing verdict accuracy and chain-of-thought quality.

7. Significance, Limitations, and Applications

The PaV approach formalizes judgment as a position-robust, reward-driven comparative reasoning task for LLMs. Notable strengths include:

Superior comparative reasoning: Contextual side-by-side response evaluation.
Bias reduction: Position-agnostic reward and multiple-order training.
Chain-of-thought justification: Intermediate reasoning tokens substantiate verdicts, supporting auditability.
Synthetic data efficiency: Offline data synthesis for large-scale training.

Limitations include possible chain-of-thought verbosity (leading to longer inference times), residual subjectivity in benchmarks for non-verifiable tasks, and potential overfitting risks associated with synthetic data distributions. Expert deployments require attention to these limitations, particularly in high-risk applications (judicial verdicts, content moderation).

A plausible implication is that PaV, as a principled RL recipe for LLM-as-a-Judge, will underpin future systems tasked with automated evaluation, ranking, or judging in settings characterized by subjective, verifiable, and hybrid challenge formats.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Pairwise J1 with Verdict (PaV).