Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 161 tok/s Pro

2000 character limit reached

Reinforcement Learning with Rubric Anchors (2508.12790v1)

Published 18 Aug 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing LLMs, exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

Collections

Summary

The paper introduces a rubric-based RL framework that extends reward models to subjective, open-ended tasks using multi-dimensional evaluative anchors.
The methodology employs advanced aggregation strategies, including veto mechanisms and saturation-aware weighting, to enhance style control and mitigate reward hacking.
Experimental results demonstrate a +5.2% boost in creative tasks and improved human-likeness while preserving general reasoning abilities.

Reinforcement Learning with Rubric Anchors: A Technical Analysis

Introduction

The paper "Reinforcement Learning with Rubric Anchors" (Rubicon) addresses a fundamental limitation in the prevailing paradigm of Reinforcement Learning from Verifiable Rewards (RLVR) for LLMs. RLVR, as exemplified by OpenAI's o-series, leverages deterministic, programmatically verifiable signals for reward assignment, which restricts its applicability to domains with clear, objective correctness (e.g., mathematics, code generation). This work extends RLVR to open-ended tasks by introducing rubric-based reward systems, enabling scalable RL in domains where outputs are inherently subjective or multidimensional.

Rubric-Based Reward System

Formalization and Design

Rubicon formalizes rubrics as multi-dimensional evaluative anchors, each comprising a criterion description, a set of score tiers, and an associated weight. The reward function $R(y|x,\mathcal{R})$ maps a model response to a vector of scores across $K$ rubric dimensions, which are then aggregated via advanced strategies:

Veto Mechanisms: Critical dimensions can nullify the total reward if violated, serving as hard constraints.
Saturation-Aware Aggregation: Diminishing returns are modeled to prevent over-optimization of single dimensions.
Pairwise Interaction Modeling: Non-linear dependencies between criteria are explicitly captured.
Targeted Reward Shaping: Non-linear mappings amplify score differentials in high-performance regions, enhancing gradient informativeness.

This framework unifies both programmatically verifiable and open-ended evaluation protocols, supporting granular, interpretable reward signals for policy optimization.

Rubric Construction and Data Curation

Rubicon's rubric bank comprises over 10,000 rubrics, generated via human annotation, LLM synthesis, and hybrid workflows. Rubrics are constructed at multiple granularities: dataset-level, task-level, and instance-level. The rubric-first workflow ensures evaluative asymmetry—verification is easier than generation—by curating data to match rubric criteria, which are then reused for supervision, reward shaping, and evaluation.

RL Training Protocol

Multi-Stage RL Pipeline

Rubicon employs a two-stage RL protocol:

Stage 1: Focuses on instruction-following and constraint handling, using static, verifiable rubrics to build a robust foundation.
Stage 2: Targets open-ended, creative, and socially grounded tasks, leveraging instance-specific rubrics and reference-based evaluation to foster adaptability and richer expression.

Offline data filtering is applied between stages, retaining only samples within a calibrated central quantile of critic scores to maximize learning signal and minimize noise.

Defense Against Reward Hacking

Reward hacking—specious maximization of rubric scores without substantive improvement—emerges as a significant challenge. Rubicon introduces an adaptive defense rubric, synthesized from empirical analysis of rollout data, to penalize superficial reward proxies (e.g., sycophancy, self-evaluation artifacts). This mechanism is integrated as a supervisory constraint in subsequent RL stages, stabilizing training and preventing policy collapse.

Experimental Results

Quantitative Gains

Rubicon-preview (Qwen3-30B-A3B RL-trained with rubrics) demonstrates strong performance on open-ended, humanities-centric benchmarks:

+5.2% absolute improvement over the base model across creative writing, emotional intelligence, and style control tasks.
Outperforms DeepSeek-V3-671B by +2.4% on these benchmarks, despite a 22x smaller parameter count and only 5K training samples.
General and reasoning ability is preserved: No degradation on MMLU, HellaSwag, StoryCloze, CoQA, SocialIQA, and modest improvements on math datasets (AIME24: +4.17%, AIME25: +0.83%).

Qualitative Analysis: Style Control

Rubrics serve as explicit anchors for output style, enabling fine-grained control over narrative voice, emotional expressiveness, and avoidance of formulaic "AI-speak." Case studies show Rubicon-preview produces responses with greater human-likeness and stylistic authenticity compared to baseline models, as evaluated by rubric-guided critics.

Seesaw Effect and Multi-Stage Mitigation

Joint RL training on conflicting task types (constraint-following vs. creativity/empathy) induces a "seesaw effect," with performance trade-offs between domains. Rubicon's multi-stage RL schedule mitigates this by sequentially layering capabilities, achieving balanced improvements without regression in core abilities.

Implementation Considerations

Token Efficiency: Significant gains are achieved with only 5K training samples, suggesting a new post-training scaling law where rubric diversity compensates for limited data.
Computational Requirements: The multi-stage protocol and rubric-based filtering reduce overhead compared to monolithic RL runs.
Scalability: The framework is extensible to new domains by expanding the rubric bank and adapting aggregation strategies.
Limitations: Optimal rubric granularity, hierarchical structure, and defense mechanisms against reward hacking require further systematic paper.

Implications and Future Directions

Rubicon demonstrates that rubric-based RL can unlock scalable training for LLMs in non-verifiable domains, enabling controllable output style and enhanced human-likeness. The approach is complementary to RLVR and invites exploration of hybrid frameworks combining verifiable and rubric-based rewards. Open questions remain regarding the scaling laws of rubric diversity vs. token count, optimal rubric system design, and the management of reward hacking in increasingly complex RL settings.

Conclusion

"Reinforcement Learning with Rubric Anchors" establishes a principled framework for extending RL-based LLM training to open-ended tasks via structured, interpretable rubrics. The empirical results validate the efficacy of rubric-based RL in enhancing subjective and stylistic capabilities while maintaining general reasoning performance. The work provides a foundation for future research into scalable, controllable, and robust RL post-training for LLMs across diverse domains.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (21)

First 10 authors:

Tweets

https://twitter.com/jiqizhixin/status/1958060316841112074

https://twitter.com/_AndrewZhao/status/1957843663171510492

https://twitter.com/_akhaliq/status/1957837639354106014

https://twitter.com/probnstat/status/1958622033534914584

https://twitter.com/rosinality/status/1957702256511578488

https://twitter.com/girish_lelouch/status/1957838192268181979

YouTube

Show All Videos

alphaXiv

Reinforcement Learning with Rubric Anchors (102 likes, 0 questions)