C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Published 15 Apr 2026 in cs.CL and cs.LG | (2604.13618v1)

Abstract: Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4$\times$ larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a cooperative-critical framework (C2) that enhances reward model accuracy by jointly training a rubric generator and a verifier using binary preferences.
It employs contrastive rubric synthesis and explicit rejection of misleading cues to ensure robustness and finer-grained evaluation in reward modeling.
Empirical results show significant accuracy gains (up to +6.5 points) and improved performance on RLHF benchmarks while reducing reliance on costly external annotations.

C2: A Scalable and Trustworthy Framework for Rubric-Augmented Reward Modeling

Motivation and Problem Formulation

Rubric-augmented reward modeling has shown potential to increase reliability and interpretability in reinforcement learning from human feedback (RLHF), especially for tasks with implicit or subjective evaluation criteria. While single-model reward models (RMs) trained only on binary preferences often lack robustness and interpretability, rubric-guided approaches can decompose holistic evaluation into tractable sub-questions, offering finer-grained and more trustworthy judgment signals. However, previous rubric-augmented methods depend on expensive external rubric annotations or proprietary large models, which fundamentally restricts their scalability.

Crucially, the paper identifies a significant challenge: naive self-generated rubrics exhibit high variance in quality—while some lead to large improvements in RM accuracy, many are neutral, and low-quality rubrics actively harm verification. Past work largely overlooks the risk of verification performance degradation due to misleading rubrics. The authors hypothesize that robust and scalable rubric-augmented verification from binary preferences is possible if the system deliberately incorporates a mechanism for critical communication: the generator is trained to provide helpful rubrics, while the verifier is explicitly trained to assess and reject misleading guidance.

Proposed Method: Cooperative yet Critical Reward Modeling (C2)

The C2 framework formalizes rubric generation and selection as a two-agent cooperative-critical process, jointly optimizing a cooperative rubric generator and a critical verifier using binary preference data alone. The pipeline consists of the following components:

Contrastive Rubric Synthesis: For each prompt and response pair, multiple rubric candidates are generated and labeled as "helpful" or "misleading" depending on whether they shift the verifier’s judgment toward or away from the gold standard label. This is determined by margin improvement in the verifier’s confidence.
Cooperative Generator Training: The rubric generator is optimized with Direct Preference Optimization (DPO) on these contrastive pairs; it is trained to increase the likelihood of generating rubrics classified as helpful over misleading ones.
Critical Verifier Training: The verifier is trained via Group Relative Policy Optimization (GRPO) to first assess rubric validity—outputting a binary decision (“helpful” or “misleading”)—before producing its verdict. If it rejects the rubric, it falls back to rubric-free evaluation.

This explicit design allows the generator to explore the rubric space in a way that iteratively improves verifier judgements, while the verifier retains control by dynamically selecting which rubrics to trust at inference time.

Empirical Evaluation

Rubric Quality Analysis

The paper presents a thorough empirical dissection to highlight the two-sided nature of self-generated rubrics. By scoring rubrics according to how well they reflect the task's core intent and enable correct preference discrimination, the authors demonstrate that:

Most self-generated rubrics exert negligible effect (confidence shift $\Delta$ concentrated near zero).
High-quality rubrics yield strong accuracy gains (e.g., +8.2% for Tulu3-8B-SFT and +13.6% for Qwen3-8B on RM-Bench).
Low-quality rubrics drastically reduce verifier accuracy, evidencing that indiscriminate rubric usage can degrade performance below baseline.

Main Results

The C2 framework is compared against four settings:

Baseline pretrained model
Reasoning reward model (GRPO-trained, no rubric guidance)
Self-rubric augmented reward model (naive)
External large-model rubric augmentation
C2's cooperative-critical training with dynamic rubric selection

Key benchmark findings include:

C2 achieves up to +6.5 accuracy points over reasoning reward models on challenging benchmarks (e.g., RM-Bench).
On Qwen3-8B, C2 matches the performance of external-rubric models that utilize rubrics from 4x larger LMs.
In RLHF settings (AlpacaEval 2.0, Arena-Hard), DPO-trained policies using C2-aligned RMs yield up to +6.0 points improvement in length-controlled win rate.
C2 retains robustness with only minor accuracy drops even as the proportion of low-quality rubrics increases, outclassing baselines which plummet in performance in the presence of rubric noise.

Further ablation demonstrates the essentiality of all components: removing negative (misleading) rubrics during training triggers the largest performance drop, reinforcing the necessity of adversarial examples in robust verification.

Cost, Latency, and Compute

Analysis establishes that C2’s superior performance does not arise solely from increased inference-time compute. Even under compute-matched settings (majority voting for reasoning RMs), C2 maintains a 2–3 point lead. Inference is approximately 2.3–2.4x slower than reasoning RMs, but the architectural gains are primarily attributed to the cooperative-critical training design and selective rubric filtering.

Theoretical and Practical Implications

From a theoretical standpoint, C2 operationalizes the principle of cooperative communication (à la Grice and Sperber)—the generator and verifier continually calibrate signals and trust in pursuit of robust verification. This approach effectively transforms the standard reward modeling paradigm into a dynamic, interaction-based system capable of leveraging preference data without suffering from annotation cost or unaddressed failure of cooperation.

Practically, C2 enables more scalable and trustworthy RM construction. It replaces dependence on handcrafted or proprietary rubrics with a self-improving loop that maintains accuracy while reducing labeling overhead. This makes it feasible to exploit massive existing binary preference corpora for RLHF alignment and post-training evaluation in domains with implicit criteria. Moreover, enabling the verifier to reject misleading evaluation rubrics significantly increases trustworthiness and interpretability.

Limitations and Future Directions

Model performance still depends on the base model's capacity for reasoning: weaker LMs struggle more to distinguish between high- and low-quality rubric guidance, which may lead to under-utilization of helpful rubrics. Additionally, the compute and latency overhead from rubric generation and repeated inference remains nontrivial. Targeted research into more efficient rubric generation, better training signal selection, and adaptive inference schemes will be critical for deployment in resource-constrained environments.

Further, reliable rubric assessment remains challenging—hallucinations, model confusion, and adversarial input may still induce errors, especially in ambiguous contexts. Extensions toward robust uncertainty calibration, more granular rubric evaluation, and explicit out-of-distribution handling constitute promising avenues for future work.

Conclusion

C2 provides a methodologically rigorous, scalable, and robust strategy for rubric-augmented reward modeling solely using cheap binary preferences. By formalizing and implementing deliberate cooperation and critical assessment between rubric generators and verifiers, the framework achieves significant accuracy and RLHF downstream alignment improvements without external rubric supervision. This work lays foundational groundwork for trustworthy, interpretable, and practically scalable reward modeling—an essential component in robust LLM alignment and evaluation pipelines.

Markdown Report Issue