Contrastive Rubric Generation (CRG) Algorithm

Updated 21 January 2026

The CRG algorithm is a method to generate contrastive natural language rubrics by comparing positive and negative responses to prompts.
It leverages a margin-based contrastive loss to extract interpretable criteria, enhancing reward modeling and LLM alignment on diverse benchmarks.
It integrates a dataset construction pipeline and rejection sampling to ensure rubric consistency and scalable deployment across domains.

OpenRubrics is a modular system for scalable synthetic rubric generation, designed to support reward modeling and alignment in LLMs. The architecture is optimized to elicit, structure, filter, and utilize multi-criteria natural language rubrics, with the aim of replacing or augmenting scalar/pairwise evaluations that inadequately capture the multifaceted nature of response quality. The system demonstrates significant gains in both the accuracy of reward models and downstream policy alignment on diverse instruction-following and biomedical tasks, substantiating the principle-driven paradigm for rubric-based LLM alignment (Liu et al., 9 Oct 2025).

1. Dataset Construction Pipeline

OpenRubrics synthesizes its foundational dataset from heterogeneous open-ended dialogue and instruction-following corpora. Data sources include UltraFeedback (Evol-Instruct, UltraChat, ShareGPT, TruthfulQA), Tulu 2.5 (AlpacaFarm, Chatbot Arena, Capybara, StackExchange, Nectar, SHP, HH-RLHF, HelpSteer, Orca), HelpSteer 3, Skywork-Preference, Tulu 3-IF, MegaScience, and Medical-o1.

Preference pairs are constructed through three primary mechanisms: selection of highest and lowest scoring responses (UltraFeedback), response generation and ranking using open-source reward models (Tulu3-IF/MegaScience/Medical-o1; Athene-RM-8B, Skywork-Reward-V2), and programmatic verifiability checks (verifiable-IF). These pairs undergo preprocessing steps that include discarding trivial or format-violating pairs, token truncation/canonicalization, and deduplication by prompt-response fingerprints. The result is a dataset $\mathcal{D}=\{(x_i,\hat y_i^+,\hat y_i^-,\ell_i)\}_{i=1}^N$ , where $\ell_i \in\{+1,-1\}$ encodes the direction of preference.

2. Contrastive Rubric Generation (CRG)

To capture discriminative and interpretable evaluation signals, CRG operates by contrasting a chosen ( $\hat y_i^+$ ) and a rejected ( $\hat y_i^-$ ) response to the same prompt $x_i$ , producing a set of rubric criteria. Each CRG invocation consists of (1) extracting non-negotiable hard rules from $x_i$ , (2) identifying concrete differences between responses, and (3) abstracting those differences into domain-agnostic principles. A margin-based contrastive loss (optional) is used to encourage the generation of criteria that effectively separate positive from negative responses: $\mathcal L_{\mathrm{CRG}} = \sum_{i=1}^N\sum_{j=1}^{K_i} \left[ -\log\sigma\big(s_\psi(c_{i,j},x_i,\hat y_i^+) - s_\psi(c_{i,j},x_i,\hat y_i^-)\big) \right]$ where $s_\psi(c,x,y)$ denotes the compatibility score between criterion and response. The generator backbone is a pretrained instruction-tuned LLM $h_\psi$ .

3. Rejection Sampling for Preference-Label Consistency

To enforce rubric reliability, OpenRubrics implements rejection sampling based on preference-label consistency. Each rubric $\mathcal R(x_i)$ generated from a triplet $(x_i, \hat y_i^+, \hat y_i^-, \ell_i)$ is retained only if the generator $h_\psi$ correctly predicts the original preference label $\ell_i$ when given the rubric as context. This filtering yields a reliable set $\mathcal D_{\mathrm{rubric}} = \{(x_i,\hat y^+_i,\hat y^-_i,\mathcal R^*(x_i))\}_{i=1}^M, M \le N$ .

4. Rubric-RM Reward Model Architecture

OpenRubrics centers its modeling around two supervised modules:

Rubric Generator $g_\theta$ :

— Backbone: Qwen-3 (4B/8B) — Input: "<PROMPT>…<RESPONSE+>…<RESPONSE−>…" — Output: Structured rubric $\mathcal R(x)$ . — Training: Next-token cross-entropy

$\mathcal L_{\mathrm{SFT}^{\mathrm{rubric}}} = -\mathbb E_{(x,y^+,y^-,\mathcal R^*)}\sum_{t=1}^{|\mathcal R^*|}\log p_\theta(\mathcal R^*_t | x, y^+, y^-, \mathcal R^*_{<t})$

Rubric-Conditioned Judge $r_\phi$ :

— Backbone: Qwen-3 (same scale) — Input: $[x;y^+;y^-;\mathcal R(x)]$ — Output: Label in $\{\text{“A is better”}, \text{“B is better”}\}$ — Training: Next-token cross-entropy

$\mathcal L_{\mathrm{SFT}^{\mathrm{rm}}} = -\mathbb E_{(x,y^+,y^-,R^*,\ell)}\sum_{t=1}^{|\ell|}\log p_\phi(\ell_t | x, y^+, y^-, R^*, \ell_{<t})$

Key hyperparameters for Rubric-RM-8B include batch size 64, learning rate $5\times10^{-6}$ , number of epochs 2, and $6144$ maximum tokens.

5. End-to-End Integration and Inference Workflow

The complete OpenRubrics workflow is as follows:

Triplet collection: $\{(x,\hat y^+,\hat y^-,\ell)\}$ .
CRG plus rejection sampling yields reliable rubrics $\mathcal R^*(x)$ .
Supervised fine-tuning of $g_\theta$ yields a rubric generator.
Supervised fine-tuning of $r_\phi$ yields a rubric-conditioned judge.
For new input $(x, y^A, y^B)$ , $g_\theta$ generates a rubric $\hat{\mathcal R}(x)$ , then $r_\phi$ predicts which of $y^A$ or $y^B$ is preferable given $\hat{\mathcal R}(x)$ .

6. Scalability, Efficiency, and Benchmarks

Rubric-RM displays competitive performance across reward-modeling and downstream alignment tasks. On RewardBench, RM-Bench, IFBench, Rubric-RM-4B achieves 65.6% accuracy, Rubric-RM-8B reaches 68.5%, and Rubric-RM-8B-voting@5 attains 71.2%, approaching larger proprietary models. In policy fine-tuning (DPO), Rubric-RM leads to +3–4 points improvement over standard RLHF policies on IFEval and InfoBench, and achieves top open-source win rates (50–57%) on Arena-Hard and AlpacaEval. For biomedical (HealthBench), Rubric-RM-8B yields 68.3% versus RRM-7B's 63.3%. In large-scale scoring, Rubric-RM-8B is substantially faster (130s for 100 pairs) than RRM-7B (203s) and RM-R1-14B (322–382s), with rubric amortization enabling further acceleration.

7. Significance and Implications

The OpenRubrics architecture synthesizes contrastively generated, consistency-filtered rubrics as interpretable, multi-criterion scaffolds for reward modeling. This paradigm narrows the gap between human evaluation and automated RLHF, offering scalable and principle-driven alignment for LLMs. A plausible implication is that rubrics provide more granular, transparent reward signals, advancing beyond scalar/pairwise judges and enabling reliable deployment in both generalist and domain-specific (e.g., biomedical) contexts (Liu et al., 9 Oct 2025).

Markdown Upgrade to Chat

References (1)

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Rubric Generation (CRG) Algorithm.