Reinforcement Learning with Verifiable Rewards (RLVR): Bridging Objective and Subjective Domains
Last updated: June 11, 2025
Reinforcement Learning with Verifiable Rewards (RLVR °) has been transformative for LLMs ° in strictly objective domains, but "Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards" demonstrates how to extend these advantages to non-verifiable generative tasks, such as creative writing ° and open-ended dialogue. Here’s a detailed technical breakdown covering the major contributions and methods from the paper.
1. RLVR in Context
RLVR leverages rewards that are non-subjective—verifiable through programmatic means or universally accepted references. For math and coding, correctness is binary and checkable. However, creative writing and open-ended dialogue lack objective references, presenting intrinsic challenges for reward construction and RL fine-tuning °.
2. Challenges in Non-verifiable Tasks
- Subjectivity: Writing quality ° is assessed contextually (style, tone, impact) rather than exact matches.
- Reward Hacking °: Scalar reward models ° trained on human preference data ° can be gamed, resulting in artifacts like over-explanation and length bias.
- Poor Generalization: Human preferences, being inconsistent and task-specific, lead to unreliable RL signals if the reward model is trained naively.
3. Unified RLVR-based Paradigm: Writing-Zero's Approach
The paper proposes a robust, scalable RLVR extension for non-verifiable domains, powered by two innovations:
a. Writing-Principle-Based Pairwise Generative Reward Model (GenRM)
- Self-Principled Critique: GenRM ° harnesses pre-defined and dynamically selected writing principles (clarity, informativeness, coherence, etc.), prompting the LLM ° to produce critiques for pairs of candidate responses.
- Pairwise Scoring: For a prompt and candidate/comparator pair , GenRM generates:
- A set of writing principles .
- A combined critique using those principles.
- Two scores , extracted with , providing a direct comparative signal.
- Verifiable Reward Extraction: The relative score is converted using:
$R_\text{acc} = \begin{cases} 1 & S_c > S_r \ -1 & S_c < S_r \ 0 & \text{otherwise} \end{cases}$
and additional sanity-check filters ensure reward stability ° (requiring sufficient margin, format, etc.).
b. Bootstrapped Relative Policy Optimization (BRPO)
- Dynamic, Reference-Free Policy Evaluation: During RL, for each training batch, one response is selected as a “bootstrapped” reference from among rollouts.
- Pairwise Comparison: Each output is scored against using GenRM, yielding if , otherwise.
- Sample Validity Filtering: To avoid degenerate or biased groups (e.g., all wins or all losses), rollouts are filtered with:
- Advantage for RL Update: The result (or its normalized version) is used directly as the advantage for policy gradient methods ° (e.g., GRPO-style loss):
4. Technical Workflow: Step-by-Step Example
Here's how Writing-Zero’s RLVR pipeline works in practice:
- Batch Sampling: For each prompt, generate a group of candidate responses with the current model policy.
- Reference Bootstrapping: Randomly select one as the temporary “reference” .
- Pairwise GenRM Critique: For each other response in the group:
- Evaluate both and using the writing-principle-based GenRM.
- Extract scores and compute (1, -1, or 0).
- Filter Batches: If responses are too similar or group is too homogeneous (by ), discard batch for stability.
- Policy Gradient Update: Use as group-structured advantages for standard GRPO-style policy updates, including KL regularization ° to anchor exploration.
- Iteration: The policy is updated repeatedly, always comparing within the “current cohort” rather than fixed references.
5. Results and Empirical Benefits
- Reward Hacking Resistance: GenRM-based, pairwise/groupwise critique models are robust against length bias and over-explanation—issues plaguing scalar reward models.
- Fine-Grained Improvement: The RLVR paradigm with GenRM and BRPO consistently outperforms SFT ° and scalar reward RL (see Table below). Human evaluation further confirms improved writing quality.
Model | WritingBench | Writing Testset |
---|---|---|
Qwen3-32B-Base | 6.89 | 1.23 |
ScalarRM-GRPO (baseline RL) | 8.87 | 2.83 |
Writing-Zero (GenRM-BRPO) | 8.29 | 3.84 |
- Human Preference: In head-to-head human paper, Writing-Zero is favored over scalar RM RL or SFT, especially for coherence, informativeness, and creativity.
- Generalization: The pairwise, principle-driven reward generalizes well to new prompts and genres, even outperforming closed LLM-judge setups in non-English settings.
6. RLVR Unification and Broader Implications
With this approach:
- Verifiable Reward Extension: By centering on principle-based, programmatic, and pairwise assessment, even subjective tasks ° admit a form of verifiability—introspectively checkable and less susceptible to accidental drift.
- Unified RLVR Landscape:
- Rule-based: Math/code (objective, hard reference).
- Reference-based: QA, paraphrase (looser matching).
- Reference-free/model-based/POS pairwise: Writing-Zero’s domain, now made reliably trainable under RLVR principles.
- Modularity & Test-time Scaling: GenRM-based approaches admit ensemble strategies ° for more robust choices at inference time.
7. Implementation Considerations
- Scalability: GenRM can be trained efficiently from LLMs with just moderate data volume, thanks to rich principle prompting.
- Integration: Transitioning an existing RLHF ° pipeline to BRPO/GenRM involves slotting pairwise critique + bootstrapped rollouts in place of scalar reward inference °.
- Limitations: The main constraint is GenRM quality—stability and robustness depend on effective and well-covered principle suites, and on judicious batch/group sampling.
8. Pseudocode for BRPO within RLVR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
responses = [model.generate(x) for _ in range(G)] ref_idx = random.choice(range(G)) o_ref = responses[ref_idx] R = [] for i, o in enumerate(responses): if i == ref_idx: continue S_i, S_ref = genrm.score(x, o, o_ref) if abs(S_i - S_ref) < margin: reward = 0 # Uninformative elif S_i > S_ref: reward = 1 else: reward = -1 R.append(reward) update_policy(responses, R) |
9. Key Mathematical Formulas
- Group Advantage (for GRPO-style loss):
- Policy Update (per GRPO):
Summary Table: Extending RLVR to Subjective Tasks
Aspect | Baseline Scalar RM | Writing-Zero (GenRM+BRPO) |
---|---|---|
Reward source | Scalar (human/LLM) | Pairwise, principled critique |
Reference usage | Fixed static ref | Dynamic, group bootstrapped |
Hacking susceptibility | High (verbatim | Low; resists verbosity, over-explanation |
Robustness to domain | Moderate | High (principle-driven, cross-domain) |
Human preference | Variable | Higher |
Integration complexity | Standard RL | Simple slot-in for RLVR RL pipelines |
Conclusion
Writing-Zero delivers a unifying RLVR-compatible framework for subjective generation tasks, making reward as verifiable as possible via self-principled, pairwise critique, and robust policy optimization—bridging the gap between objectively checkable and subjective generation tasks, and enabling reliable, hack-resistant, and scalable LLM training ° for the full spectrum of NLP applications °.