Papers
Topics
Authors
Recent
Search
2000 character limit reached

V-PairRL: Unified Generator and Verifier for LLMs

Updated 6 March 2026
  • The paper introduces a unified generator-verifier framework that co-trains a single LLM to both generate solutions and verify them via pairwise comparisons.
  • The approach employs a composite RL objective integrating generation and pairwise verification to enhance sample efficiency and Pass@1 performance on code and math benchmarks.
  • The unified training strategy leverages prompt engineering to switch roles without extra memory cost, ensuring in-distribution calibration and computational efficiency.

V-PairRL is a unified reinforcement learning framework for LLMs that co-trains a single policy both as a solution generator and as a pairwise self-verifier. In contrast to conventional approaches that use scalar, pointwise scoring for verification, V-PairRL exploits the empirical finding that models are substantially stronger at head-to-head, pairwise self-verification. The technique is instantiated in the V1V_1 framework, where a single decoder-only transformer, such as Qwen3-4B-Instruct-2507, is jointly optimized for both solution generation (e.g., code or math response) and pairwise comparison of candidates via direct reward and auxiliary clipped-PPO objectives. This joint training ensures that the verifier remains calibrated to the generator’s evolving solution distribution, enabling more effective verification, improved sample efficiency, and state-of-the-art task performance on code generation and math reasoning benchmarks (Singh et al., 4 Mar 2026).

1. Unified Generator-Verifier Architecture

V-PairRL leverages a single pretrained, decoder-only LLM to serve two complementary roles via prompt engineering, with no change to the underlying model weights or architecture. The system is structured as follows:

  • Generator mode: Receives a prompt (code or math task), samples reasoning chains and outputs final solutions.
  • Verifier mode: Receives a pair of candidate solutions, producing two scalar ratings that reflect relative solution quality. Both modes invoke the identical model with specific prompt templates—generation prompts for sampling, and verification prompts for emitting paired scalar ratings. The transformer backbone is unmodified: token embeddings, multihead self-attention, and feedforward layers remain as in standard architectures (see Vaswani et al. 2017).

This prompt-based bifurcation avoids additional memory cost and enables weight-sharing across the two functional capacities, forming a tightly coupled generator-verifier policy [(Singh et al., 4 Mar 2026), Fig. 1].

2. Reinforcement-Learning Objective

V-PairRL employs a composite RL objective integrating both generation and pairwise verification terms:

J(θ)=JGen(θ)+λJPairVerif(θ)J(\theta) = J_\text{Gen}(\theta) + \lambda J_\text{PairVerif}(\theta)

  • Generation Objective (JGenJ_\text{Gen}): Adopts Group-Relative PPO (GRPO). For each prompt qq, GG candidate solutions {oi}\{o_i\} are sampled from the current policy. Each is assigned a binary reward (ri∈{0,1}r_i \in \{0, 1\}) via automatic test-case execution, and the group mean rˉ\bar{r} defines centered advantages. The clipped policy gradient surrogate is computed per standard PPO, with the advantage Ai=ri−rˉA_i = r_i - \bar{r} and clip ratios ϵ∈{0.2,0.28}\epsilon \in \{0.2, 0.28\}.
  • Pairwise Verification Objective (JPairVerifJ_\text{PairVerif}): For each batch, KK pairs {(sA,sB)k}\{(s_A, s_B)_k\} are formed with at least one correct candidate per pair. Each is fed to the LLM in a pairwise prompt; the model emits two integer ratings rA,rB∈{1,…,10}r_A, r_B \in \{1,\dots,10\}, normalized to [0,1][0,1] as vA,vBv_A,v_B. Pairwise verifier rewards are assigned by matching each scalar rating to ground-truth binary correctness labels and applying a tolerance indicator loss:

rverif(k)=12[I(∣vA−yA∣≤0.2)(1−∣vA−yA∣)+I(∣vB−yB∣≤0.2)(1−∣vB−yB∣)]r_\text{verif}^{(k)} = \tfrac{1}{2} \left[ \mathbb{I}(|v_A-y_A| \leq 0.2) (1-|v_A-y_A|) + \mathbb{I}(|v_B-y_B| \leq 0.2) (1-|v_B-y_B|) \right]

The PPO-style policy update is then applied using the verifier advantages Averif(k)=rverif(k)−RˉverifA_\text{verif}^{(k)} = r_\text{verif}^{(k)} - \bar{R}_{verif}, where Rˉverif\bar{R}_{verif} is the batch mean.

The complete loss is a weighted sum, with λ=1.0\lambda = 1.0 by default.

3. Training Algorithm and Pair Sampling Strategy

Training iterates over batches as follows:

  1. Generation rollouts: For each prompt qjq_j, GG solutions {sj,i}\{s_{j,i}\} are sampled and scored for correctness.
  2. Pair selection: Candidate solutions are partitioned as correct (CjC_j) or incorrect (WjW_j). KK pairs are randomly selected from Cj×Wj∪Cj×CjC_j \times W_j \cup C_j \times C_j, strictly excluding Wj×WjW_j \times W_j pairs to prevent reward-hacking and trivial reward loops.
  3. Verifier rollouts: Each selected pair is scored by the LLM in verification mode to obtain normalized ratings. Pairwise verification rewards and advantages are computed.
  4. Policy update: Clipped-PPO objectives are evaluated independently for generator and verifier rollouts. The model parameters are updated in the joint direction.

There is no uncertainty-driven pair prioritization—sampling is uniform among valid pairs that contain at least one correct candidate. Regularization excludes KL penalties, entropy bonuses, and value-fn standardization; losses are aggregated as mean over tokens.

4. Hyperparameters and Training Protocol

Key training protocol and hyperparameters are shown below:

Hyperparameter Value
Model Qwen3-4B-Instruct-2507
Learning rate 1×10−61 \times 10^{-6}
Batch size 64
Solve rollouts per prompt 4 (8 for pure-RL)
Verify pairs per prompt K 4 (0 for pure-RL)
Optimizer AdamW
PPO clip ratios [0.2, 0.28]
λ\lambda 1.0
Temperature (sampling) 0.6
Top-p (sampling) 0.95
Max prompt length 10,240 tokens
Max response length 24,576 tokens
Training steps 150
Checkpointing Best Pass@1 on validation

All experiments used rLLM/verl HybridFlow. The baseline for comparison is RL with 8 generation rollouts; for V-PairRL, the budget is 4 solve plus 4 verify rollouts per prompt.

5. Empirical Performance and Comparative Outcomes

V-PairRL was evaluated on code-generation (LiveCodeBench-v5/6, CodeContests, SWE-bench Lite) and math reasoning (AIME, HMMT) tasks. Performance metrics focus on Pass@1 (fraction of tasks solved by the top-ranked candidate) and test-time scaling (improvement by aggregating multiple independent LLM solutions with self-verification).

Main Results

Model LCB-v5 LCB-v6 CodeContests
RL Baseline 46.0% 44.1% 58.9%
V-PairRL 48.9% 46.8% 67.6%
(Δ) +2.9% +2.7% +8.7%

Test-time scaling with pairwise self-verification (V-Infer) at 2× computation shows:

Model + V-Infer LCB-v5 LCB-v6 CodeContests
RL Baseline 47.5% 45.1% 61.4%
V-PointRL 47.4% 45.7% 63.0%
V-PairRL 53.9% 52.5% 70.3%

These results correspond to test-time scaling gains of 7–9% over standard RL and pointwise joint training, and 8.9% over RL baseline in code generation. On SWE-bench Lite, the resolve rate increased by 5.0% versus pointwise scoring. For math reasoning (AIME, HMMT), the V-Infer procedure (not RL-trained on math) yields +10% improvement over pointwise on HMMT and +6.7% on AIME.

6. Computational Efficiency and Calibration

V-PairRL's unified approach offers strong computational efficiency:

  • Single-model pipeline: No additional memory for a separate verifier; generation and verification share the full transformer and checkpoint.
  • Training overhead: Total rollouts fixed at 8 per prompt (4 solve, 4 verify), matching the pure RL baseline in compute. Control experiments show that even doubling solver rollouts in pure RL cannot match the gains from co-training.
  • Inference overhead: Identical to other pairwise verification methods—N generations plus V pairwise calls. The V-Infer algorithm achieves desired Pass@1 with up to 2× fewer LLM calls compared to pointwise or resampling approaches, due to improved candidate ordering and the higher base quality of the generator.
  • Calibration: The in-distribution co-trained verifier is better matched to the generator, yielding monotonic scaling, no diversity collapse, and higher sample efficiency in verification budget.

7. Significance and Release

By jointly optimizing for solution generation and pairwise verification in a single model, V-PairRL achieves stronger generation quality, superior in-distribution self-verification, and compute-efficient test-time scaling on code and math tasks. All code, prompts, and hyperparameters are provided in the public implementation, along with the full V-Infer Swiss-system tournament algorithm: https://github.com/HarmanDotpy/pairwise-self-verification

For an authoritative technical exposition, benchmarks, pseudocode, and empirical analysis, see "V1_1: Unifying Generation and Self-Verification for Parallel Reasoners" (Singh et al., 4 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to V-PairRL.