- The paper introduces a cooperative-critical framework (C2) that enhances reward model accuracy by jointly training a rubric generator and a verifier using binary preferences.
- It employs contrastive rubric synthesis and explicit rejection of misleading cues to ensure robustness and finer-grained evaluation in reward modeling.
- Empirical results show significant accuracy gains (up to +6.5 points) and improved performance on RLHF benchmarks while reducing reliance on costly external annotations.
C2: A Scalable and Trustworthy Framework for Rubric-Augmented Reward Modeling
Rubric-augmented reward modeling has shown potential to increase reliability and interpretability in reinforcement learning from human feedback (RLHF), especially for tasks with implicit or subjective evaluation criteria. While single-model reward models (RMs) trained only on binary preferences often lack robustness and interpretability, rubric-guided approaches can decompose holistic evaluation into tractable sub-questions, offering finer-grained and more trustworthy judgment signals. However, previous rubric-augmented methods depend on expensive external rubric annotations or proprietary large models, which fundamentally restricts their scalability.
Crucially, the paper identifies a significant challenge: naive self-generated rubrics exhibit high variance in quality—while some lead to large improvements in RM accuracy, many are neutral, and low-quality rubrics actively harm verification. Past work largely overlooks the risk of verification performance degradation due to misleading rubrics. The authors hypothesize that robust and scalable rubric-augmented verification from binary preferences is possible if the system deliberately incorporates a mechanism for critical communication: the generator is trained to provide helpful rubrics, while the verifier is explicitly trained to assess and reject misleading guidance.
Proposed Method: Cooperative yet Critical Reward Modeling (C2)
The C2 framework formalizes rubric generation and selection as a two-agent cooperative-critical process, jointly optimizing a cooperative rubric generator and a critical verifier using binary preference data alone. The pipeline consists of the following components:
- Contrastive Rubric Synthesis: For each prompt and response pair, multiple rubric candidates are generated and labeled as "helpful" or "misleading" depending on whether they shift the verifier’s judgment toward or away from the gold standard label. This is determined by margin improvement in the verifier’s confidence.
- Cooperative Generator Training: The rubric generator is optimized with Direct Preference Optimization (DPO) on these contrastive pairs; it is trained to increase the likelihood of generating rubrics classified as helpful over misleading ones.
- Critical Verifier Training: The verifier is trained via Group Relative Policy Optimization (GRPO) to first assess rubric validity—outputting a binary decision (“helpful” or “misleading”)—before producing its verdict. If it rejects the rubric, it falls back to rubric-free evaluation.
This explicit design allows the generator to explore the rubric space in a way that iteratively improves verifier judgements, while the verifier retains control by dynamically selecting which rubrics to trust at inference time.
Empirical Evaluation
Rubric Quality Analysis
The paper presents a thorough empirical dissection to highlight the two-sided nature of self-generated rubrics. By scoring rubrics according to how well they reflect the task's core intent and enable correct preference discrimination, the authors demonstrate that:
- Most self-generated rubrics exert negligible effect (confidence shift Δ concentrated near zero).
- High-quality rubrics yield strong accuracy gains (e.g., +8.2% for Tulu3-8B-SFT and +13.6% for Qwen3-8B on RM-Bench).
- Low-quality rubrics drastically reduce verifier accuracy, evidencing that indiscriminate rubric usage can degrade performance below baseline.
Main Results
The C2 framework is compared against four settings:
- Baseline pretrained model
- Reasoning reward model (GRPO-trained, no rubric guidance)
- Self-rubric augmented reward model (naive)
- External large-model rubric augmentation
- C2's cooperative-critical training with dynamic rubric selection
Key benchmark findings include:
- C2 achieves up to +6.5 accuracy points over reasoning reward models on challenging benchmarks (e.g., RM-Bench).
- On Qwen3-8B, C2 matches the performance of external-rubric models that utilize rubrics from 4x larger LMs.
- In RLHF settings (AlpacaEval 2.0, Arena-Hard), DPO-trained policies using C2-aligned RMs yield up to +6.0 points improvement in length-controlled win rate.
- C2 retains robustness with only minor accuracy drops even as the proportion of low-quality rubrics increases, outclassing baselines which plummet in performance in the presence of rubric noise.
Further ablation demonstrates the essentiality of all components: removing negative (misleading) rubrics during training triggers the largest performance drop, reinforcing the necessity of adversarial examples in robust verification.
Cost, Latency, and Compute
Analysis establishes that C2’s superior performance does not arise solely from increased inference-time compute. Even under compute-matched settings (majority voting for reasoning RMs), C2 maintains a 2–3 point lead. Inference is approximately 2.3–2.4x slower than reasoning RMs, but the architectural gains are primarily attributed to the cooperative-critical training design and selective rubric filtering.
Theoretical and Practical Implications
From a theoretical standpoint, C2 operationalizes the principle of cooperative communication (à la Grice and Sperber)—the generator and verifier continually calibrate signals and trust in pursuit of robust verification. This approach effectively transforms the standard reward modeling paradigm into a dynamic, interaction-based system capable of leveraging preference data without suffering from annotation cost or unaddressed failure of cooperation.
Practically, C2 enables more scalable and trustworthy RM construction. It replaces dependence on handcrafted or proprietary rubrics with a self-improving loop that maintains accuracy while reducing labeling overhead. This makes it feasible to exploit massive existing binary preference corpora for RLHF alignment and post-training evaluation in domains with implicit criteria. Moreover, enabling the verifier to reject misleading evaluation rubrics significantly increases trustworthiness and interpretability.
Limitations and Future Directions
Model performance still depends on the base model's capacity for reasoning: weaker LMs struggle more to distinguish between high- and low-quality rubric guidance, which may lead to under-utilization of helpful rubrics. Additionally, the compute and latency overhead from rubric generation and repeated inference remains nontrivial. Targeted research into more efficient rubric generation, better training signal selection, and adaptive inference schemes will be critical for deployment in resource-constrained environments.
Further, reliable rubric assessment remains challenging—hallucinations, model confusion, and adversarial input may still induce errors, especially in ambiguous contexts. Extensions toward robust uncertainty calibration, more granular rubric evaluation, and explicit out-of-distribution handling constitute promising avenues for future work.
Conclusion
C2 provides a methodologically rigorous, scalable, and robust strategy for rubric-augmented reward modeling solely using cheap binary preferences. By formalizing and implementing deliberate cooperation and critical assessment between rubric generators and verifiers, the framework achieves significant accuracy and RLHF downstream alignment improvements without external rubric supervision. This work lays foundational groundwork for trustworthy, interpretable, and practically scalable reward modeling—an essential component in robust LLM alignment and evaluation pipelines.