Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Rubric Generation (CRG) Algorithm

Updated 21 January 2026
  • The CRG algorithm is a method to generate contrastive natural language rubrics by comparing positive and negative responses to prompts.
  • It leverages a margin-based contrastive loss to extract interpretable criteria, enhancing reward modeling and LLM alignment on diverse benchmarks.
  • It integrates a dataset construction pipeline and rejection sampling to ensure rubric consistency and scalable deployment across domains.

OpenRubrics is a modular system for scalable synthetic rubric generation, designed to support reward modeling and alignment in LLMs. The architecture is optimized to elicit, structure, filter, and utilize multi-criteria natural language rubrics, with the aim of replacing or augmenting scalar/pairwise evaluations that inadequately capture the multifaceted nature of response quality. The system demonstrates significant gains in both the accuracy of reward models and downstream policy alignment on diverse instruction-following and biomedical tasks, substantiating the principle-driven paradigm for rubric-based LLM alignment (Liu et al., 9 Oct 2025).

1. Dataset Construction Pipeline

OpenRubrics synthesizes its foundational dataset from heterogeneous open-ended dialogue and instruction-following corpora. Data sources include UltraFeedback (Evol-Instruct, UltraChat, ShareGPT, TruthfulQA), Tulu 2.5 (AlpacaFarm, Chatbot Arena, Capybara, StackExchange, Nectar, SHP, HH-RLHF, HelpSteer, Orca), HelpSteer 3, Skywork-Preference, Tulu 3-IF, MegaScience, and Medical-o1.

Preference pairs are constructed through three primary mechanisms: selection of highest and lowest scoring responses (UltraFeedback), response generation and ranking using open-source reward models (Tulu3-IF/MegaScience/Medical-o1; Athene-RM-8B, Skywork-Reward-V2), and programmatic verifiability checks (verifiable-IF). These pairs undergo preprocessing steps that include discarding trivial or format-violating pairs, token truncation/canonicalization, and deduplication by prompt-response fingerprints. The result is a dataset D={(xi,y^i+,y^i,i)}i=1N\mathcal{D}=\{(x_i,\hat y_i^+,\hat y_i^-,\ell_i)\}_{i=1}^N, where i{+1,1}\ell_i \in\{+1,-1\} encodes the direction of preference.

2. Contrastive Rubric Generation (CRG)

To capture discriminative and interpretable evaluation signals, CRG operates by contrasting a chosen (y^i+\hat y_i^+) and a rejected (y^i\hat y_i^-) response to the same prompt xix_i, producing a set of rubric criteria. Each CRG invocation consists of (1) extracting non-negotiable hard rules from xix_i, (2) identifying concrete differences between responses, and (3) abstracting those differences into domain-agnostic principles. A margin-based contrastive loss (optional) is used to encourage the generation of criteria that effectively separate positive from negative responses: LCRG=i=1Nj=1Ki[logσ(sψ(ci,j,xi,y^i+)sψ(ci,j,xi,y^i))]\mathcal L_{\mathrm{CRG}} = \sum_{i=1}^N\sum_{j=1}^{K_i} \left[ -\log\sigma\big(s_\psi(c_{i,j},x_i,\hat y_i^+) - s_\psi(c_{i,j},x_i,\hat y_i^-)\big) \right] where sψ(c,x,y)s_\psi(c,x,y) denotes the compatibility score between criterion and response. The generator backbone is a pretrained instruction-tuned LLM hψh_\psi.

3. Rejection Sampling for Preference-Label Consistency

To enforce rubric reliability, OpenRubrics implements rejection sampling based on preference-label consistency. Each rubric R(xi)\mathcal R(x_i) generated from a triplet (xi,y^i+,y^i,i)(x_i, \hat y_i^+, \hat y_i^-, \ell_i) is retained only if the generator hψh_\psi correctly predicts the original preference label i\ell_i when given the rubric as context. This filtering yields a reliable set Drubric={(xi,y^i+,y^i,R(xi))}i=1M,MN\mathcal D_{\mathrm{rubric}} = \{(x_i,\hat y^+_i,\hat y^-_i,\mathcal R^*(x_i))\}_{i=1}^M, M \le N.

4. Rubric-RM Reward Model Architecture

OpenRubrics centers its modeling around two supervised modules:

Rubric Generator gθg_\theta:

— Backbone: Qwen-3 (4B/8B) — Input: "<PROMPT>…<RESPONSE+>…<RESPONSE−>…" — Output: Structured rubric R(x)\mathcal R(x). — Training: Next-token cross-entropy

LSFTrubric=E(x,y+,y,R)t=1Rlogpθ(Rtx,y+,y,R<t)\mathcal L_{\mathrm{SFT}^{\mathrm{rubric}}} = -\mathbb E_{(x,y^+,y^-,\mathcal R^*)}\sum_{t=1}^{|\mathcal R^*|}\log p_\theta(\mathcal R^*_t | x, y^+, y^-, \mathcal R^*_{<t})

Rubric-Conditioned Judge rϕr_\phi:

— Backbone: Qwen-3 (same scale) — Input: [x;y+;y;R(x)][x;y^+;y^-;\mathcal R(x)] — Output: Label in {“A is better”,“B is better”}\{\text{“A is better”}, \text{“B is better”}\} — Training: Next-token cross-entropy

LSFTrm=E(x,y+,y,R,)t=1logpϕ(tx,y+,y,R,<t)\mathcal L_{\mathrm{SFT}^{\mathrm{rm}}} = -\mathbb E_{(x,y^+,y^-,R^*,\ell)}\sum_{t=1}^{|\ell|}\log p_\phi(\ell_t | x, y^+, y^-, R^*, \ell_{<t})

Key hyperparameters for Rubric-RM-8B include batch size 64, learning rate 5×1065\times10^{-6}, number of epochs 2, and $6144$ maximum tokens.

5. End-to-End Integration and Inference Workflow

The complete OpenRubrics workflow is as follows:

  1. Triplet collection: {(x,y^+,y^,)}\{(x,\hat y^+,\hat y^-,\ell)\}.
  2. CRG plus rejection sampling yields reliable rubrics R(x)\mathcal R^*(x).
  3. Supervised fine-tuning of gθg_\theta yields a rubric generator.
  4. Supervised fine-tuning of rϕr_\phi yields a rubric-conditioned judge.
  5. For new input (x,yA,yB)(x, y^A, y^B), gθg_\theta generates a rubric R^(x)\hat{\mathcal R}(x), then rϕr_\phi predicts which of yAy^A or yBy^B is preferable given R^(x)\hat{\mathcal R}(x).

6. Scalability, Efficiency, and Benchmarks

Rubric-RM displays competitive performance across reward-modeling and downstream alignment tasks. On RewardBench, RM-Bench, IFBench, Rubric-RM-4B achieves 65.6% accuracy, Rubric-RM-8B reaches 68.5%, and Rubric-RM-8B-voting@5 attains 71.2%, approaching larger proprietary models. In policy fine-tuning (DPO), Rubric-RM leads to +3–4 points improvement over standard RLHF policies on IFEval and InfoBench, and achieves top open-source win rates (50–57%) on Arena-Hard and AlpacaEval. For biomedical (HealthBench), Rubric-RM-8B yields 68.3% versus RRM-7B's 63.3%. In large-scale scoring, Rubric-RM-8B is substantially faster (130s for 100 pairs) than RRM-7B (203s) and RM-R1-14B (322–382s), with rubric amortization enabling further acceleration.

7. Significance and Implications

The OpenRubrics architecture synthesizes contrastively generated, consistency-filtered rubrics as interpretable, multi-criterion scaffolds for reward modeling. This paradigm narrows the gap between human evaluation and automated RLHF, offering scalable and principle-driven alignment for LLMs. A plausible implication is that rubrics provide more granular, transparent reward signals, advancing beyond scalar/pairwise judges and enabling reliable deployment in both generalist and domain-specific (e.g., biomedical) contexts (Liu et al., 9 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Rubric Generation (CRG) Algorithm.