Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
73 tokens/sec
Gemini 2.5 Pro Pro
61 tokens/sec
o3 Pro
18 tokens/sec
GPT-4.1 Pro
66 tokens/sec
DeepSeek R1 via Azure Pro
21 tokens/sec
2000 character limit reached

Reinforcement Learning with Verifiable Rewards (RLVR): Bridging Objective and Subjective Domains

Last updated: June 11, 2025

Reinforcement Learning with Verifiable Rewards (RLVR °) has been transformative for LLMs ° in strictly objective domains, but "Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards" demonstrates how to extend these advantages to non-verifiable generative tasks, such as creative writing ° and open-ended dialogue. Here’s a detailed technical breakdown covering the major contributions and methods from the paper.


1. RLVR in Context

RLVR leverages rewards that are non-subjective—verifiable through programmatic means or universally accepted references. For math and coding, correctness is binary and checkable. However, creative writing and open-ended dialogue lack objective references, presenting intrinsic challenges for reward construction and RL fine-tuning °.


2. Challenges in Non-verifiable Tasks

  • Subjectivity: Writing quality ° is assessed contextually (style, tone, impact) rather than exact matches.
  • Reward Hacking °: Scalar reward models ° trained on human preference data ° can be gamed, resulting in artifacts like over-explanation and length bias.
  • Poor Generalization: Human preferences, being inconsistent and task-specific, lead to unreliable RL signals if the reward model is trained naively.

3. Unified RLVR-based Paradigm: Writing-Zero's Approach

The paper proposes a robust, scalable RLVR extension for non-verifiable domains, powered by two innovations:

a. Writing-Principle-Based Pairwise Generative Reward Model (GenRM)

  • Self-Principled Critique: GenRM ° harnesses pre-defined and dynamically selected writing principles (clarity, informativeness, coherence, etc.), prompting the LLM ° to produce critiques for pairs of candidate responses.
  • Pairwise Scoring: For a prompt xx and candidate/comparator pair (yc,yr)(y_c, y_r), GenRM generates:
    • A set of writing principles {pi}pθ(x,yc,yr)\{p_i\} \sim p_\theta(x, y_c, y_r).
    • A combined critique C\mathcal{C} using those principles.
    • Two scores Sc,Sr[0,10]S_c, S_r \in [0,10], extracted with fextract(C)f_{\text{extract}}(\mathcal{C}), providing a direct comparative signal.
  • Verifiable Reward Extraction: The relative score is converted using:

$R_\text{acc} = \begin{cases} 1 & S_c > S_r \ -1 & S_c < S_r \ 0 & \text{otherwise} \end{cases}$

and additional sanity-check filters ensure reward stability ° (requiring sufficient margin, format, etc.).

b. Bootstrapped Relative Policy Optimization (BRPO)

  • Dynamic, Reference-Free Policy Evaluation: During RL, for each training batch, one response orefo_\text{ref} is selected as a “bootstrapped” reference from among rollouts.
  • Pairwise Comparison: Each output oio_i is scored against orefo_\text{ref} using GenRM, yielding Ri=1R_i = 1 if Si>SrefS_i > S_\text{ref}, 1-1 otherwise.
  • Sample Validity Filtering: To avoid degenerate or biased groups (e.g., all wins or all losses), rollouts are filtered with:

i=1GRiG>τfilter\frac{|\sum_{i=1}^G R_i|}{G} > \tau_\text{filter}

  • Advantage for RL Update: The result RiR_i (or its normalized version) is used directly as the advantage for policy gradient methods ° (e.g., GRPO-style loss):

θJ=Eoi,t[Riθlogπθ(oi,tx,oi,<t)]\nabla_\theta J = \mathbb{E}_{o_i, t} [R_i \cdot \nabla_\theta \log \pi_\theta(o_{i,t} | x, o_{i,<t})]


4. Technical Workflow: Step-by-Step Example

Here's how Writing-Zero’s RLVR pipeline works in practice:

  1. Batch Sampling: For each prompt, generate a group of candidate responses with the current model policy.
  2. Reference Bootstrapping: Randomly select one as the temporary “reference” orefo_\text{ref}.
  3. Pairwise GenRM Critique: For each other response oio_i in the group:
    • Evaluate both oio_i and orefo_\text{ref} using the writing-principle-based GenRM.
    • Extract scores (Si,Sref)(S_i, S_\text{ref}) and compute RiR_i (1, -1, or 0).
  4. Filter Batches: If responses are too similar or group is too homogeneous (by τfilter\tau_\text{filter}), discard batch for stability.
  5. Policy Gradient Update: Use RiR_i as group-structured advantages for standard GRPO-style policy updates, including KL regularization ° to anchor exploration.
  6. Iteration: The policy is updated repeatedly, always comparing within the “current cohort” rather than fixed references.

5. Results and Empirical Benefits

  • Reward Hacking Resistance: GenRM-based, pairwise/groupwise critique models are robust against length bias and over-explanation—issues plaguing scalar reward models.
  • Fine-Grained Improvement: The RLVR paradigm with GenRM and BRPO consistently outperforms SFT ° and scalar reward RL (see Table below). Human evaluation further confirms improved writing quality.
Model WritingBench Writing Testset
Qwen3-32B-Base 6.89 1.23
ScalarRM-GRPO (baseline RL) 8.87 2.83
Writing-Zero (GenRM-BRPO) 8.29 3.84
  • Human Preference: In head-to-head human paper, Writing-Zero is favored over scalar RM RL or SFT, especially for coherence, informativeness, and creativity.
  • Generalization: The pairwise, principle-driven reward generalizes well to new prompts and genres, even outperforming closed LLM-judge setups in non-English settings.

6. RLVR Unification and Broader Implications

With this approach:

  • Verifiable Reward Extension: By centering on principle-based, programmatic, and pairwise assessment, even subjective tasks ° admit a form of verifiability—introspectively checkable and less susceptible to accidental drift.
  • Unified RLVR Landscape:
    • Rule-based: Math/code (objective, hard reference).
    • Reference-based: QA, paraphrase (looser matching).
    • Reference-free/model-based/POS pairwise: Writing-Zero’s domain, now made reliably trainable under RLVR principles.
  • Modularity & Test-time Scaling: GenRM-based approaches admit ensemble strategies ° for more robust choices at inference time.

7. Implementation Considerations

  • Scalability: GenRM can be trained efficiently from LLMs with just moderate data volume, thanks to rich principle prompting.
  • Integration: Transitioning an existing RLHF ° pipeline to BRPO/GenRM involves slotting pairwise critique + bootstrapped rollouts in place of scalar reward inference °.
  • Limitations: The main constraint is GenRM quality—stability and robustness depend on effective and well-covered principle suites, and on judicious batch/group sampling.

8. Pseudocode for BRPO within RLVR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
responses = [model.generate(x) for _ in range(G)]
ref_idx = random.choice(range(G))
o_ref = responses[ref_idx]

R = []
for i, o in enumerate(responses):
    if i == ref_idx: continue
    S_i, S_ref = genrm.score(x, o, o_ref)
    if abs(S_i - S_ref) < margin:
        reward = 0  # Uninformative
    elif S_i > S_ref:
        reward = 1
    else:
        reward = -1
    R.append(reward)
update_policy(responses, R)


9. Key Mathematical Formulas

  • Group Advantage (for GRPO-style loss):

A^i,t=Rimean({Ri}i=1G)std({Ri}i=1G)\hat{A}_{i, t} = \frac{R_i - \mathrm{mean}(\{R_i\}_{i=1}^G)}{\mathrm{std}(\{R_i\}_{i=1}^G)}

  • Policy Update (per GRPO):

JGRPO=E[i=1Gt=1oimin(ci,t(θ)A^i,t,clip())βKL(πθπref)]\mathcal{J}_\text{GRPO} = \mathbb{E}\left[\sum_{i=1}^G \sum_{t=1}^{|o_i|} \min(c_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(\cdot)) - \beta \mathrm{KL}(\pi_\theta \| \pi_\text{ref})\right]


Summary Table: Extending RLVR to Subjective Tasks

Aspect Baseline Scalar RM Writing-Zero (GenRM+BRPO)
Reward source Scalar (human/LLM) Pairwise, principled critique
Reference usage Fixed static ref Dynamic, group bootstrapped
Hacking susceptibility High (verbatim Low; resists verbosity, over-explanation
Robustness to domain Moderate High (principle-driven, cross-domain)
Human preference Variable Higher
Integration complexity Standard RL Simple slot-in for RLVR RL pipelines

Conclusion

Writing-Zero delivers a unifying RLVR-compatible framework for subjective generation tasks, making reward as verifiable as possible via self-principled, pairwise critique, and robust policy optimization—bridging the gap between objectively checkable and subjective generation tasks, and enabling reliable, hack-resistant, and scalable LLM training ° for the full spectrum of NLP applications °.