OpenRubrics Architecture

Updated 21 January 2026

OpenRubrics is a scalable architecture that synthesizes rubrics via contrastive generation and rejection sampling to enhance LLM alignment.
It integrates structured natural language evaluations within both supervised and reinforcement learning frameworks for reliable, multidimensional feedback.
Empirical results demonstrate improved throughput and benchmark performance, offering a robust alternative to traditional human annotation methods.

OpenRubrics defines a scalable, synthetic rubric-generation and reward-modeling architecture designed to address key deficiencies in LLM alignment—specifically the limitations of scalar/pairwise judgments and static rubric schemas (Liu et al., 9 Oct 2025). The system is distinguished by its capacity for contrastive rubric generation, preference-label consistency via rejection sampling, and end-to-end integration in supervised and reinforcement learning paradigms. It enables the automatic construction of comprehensive (prompt, rubric) pairs, facilitating interpretable and multidimensional evaluation criteria for reward models, while maintaining high throughput and reliability compared to human annotation. OpenRubrics leverages structured natural language as scaffolding for alignment signals, demonstrating empirically superior performance both for reward models (Rubric-RM) and aligned LLM policies.

1. Dataset Construction Pipeline

OpenRubrics builds on a composite data pipeline sourcing preference and instruction-following samples from UltraFeedback (Evol-Instruct, UltraChat, ShareGPT, TruthfulQA), Tulu 2.5 (AlpacaFarm, Chatbot Arena, SHP, Capybara), HelpSteer 3, Skywork-Preference, MegaScience, and medical datasets. Preference pairs ( $x_i, \hat y_i^+, \hat y_i^-, \ell_i$ ) are derived by selecting chosen and rejected responses either by human rating, open-source reward model ranking (e.g., Athene-RM-8B, Skywork-Reward-V2), or programmatic verifiable-IF checks. The data is filtered for triviality (e.g., identical responses, formatting violations), truncated to ≤1024 tokens, and deduplicated by prompt-response fingerprints, resulting in a dataset $\mathcal{D}=\{(x_i, \hat y_i^+, \hat y_i^-, \ell_i)\}_{i=1}^N$ (Liu et al., 9 Oct 2025).

Source	Preference Extraction Method	Filtering Capabilities
UltraFeedback	Human rating	Deduplication/truncation
Tulu 2.5	Reward model ranking	Verifiable checks
HelpSteer 3	Reward model ranking	Canonicalization

This pipeline establishes the foundational triplet dataset for subsequent rubric synthesis and reward-model training.

2. Contrastive Rubric Generation (CRG) and Rejection Sampling

Contrastive Rubric Generation operationalizes the extraction of both "hard rules" (explicit constraints) and "principles" (implicit qualities) that distinguish chosen from rejected responses. A pretrained instruction-tuned LLM ( $h_\psi$ ) is prompted with $(x_i, \hat y_i^+, \hat y_i^-, \ell_i)$ , producing $\mathcal{R}(x_i) = \{ c_{i,1}, ..., c_{i,K_i} \}$ which codifies discriminative evaluation criteria. The procedure involves:

Extracting non-negotiable hard rules directly from prompt requirements.
Abstracting concrete differences between $\hat y_i^+$ and $\hat y_i^-$ into principles.
Optionally applying contrastive-style margin-based loss:

$\mathcal{L}_{\mathrm{CRG}} = \sum_{i=1}^N\sum_{j=1}^{K_i} \Big[-\log \sigma \big( s_\psi(c_{i,j}, x_i, \hat y_i^+) - s_\psi(c_{i,j}, x_i, \hat y_i^-)\big) \Big]$

where $s_\psi(c, x, y)$ denotes compatibility between criterion and response (Liu et al., 9 Oct 2025).

Label consistency is ensured via rejection sampling: only rubrics yielding correct preference predictions by the generator are retained ( $\mathcal D_{\mathrm{rubric}}=\{(x_i, \hat y^+_i, \hat y^-_i, \mathcal{R}^*(x_i))\}$ ), directly mitigating label-flip or noise propagation.

3. Rubric-RM Reward Model Architecture

Rubric-RM encapsulates two core modules: the rubric generator ( $g_\theta$ ), and the rubric-conditioned judge ( $r_\phi$ ), both implemented with Qwen-3 (4B/8B). The generator is supervised-fine-tuned on next-token cross-entropy:

$\mathcal{L}_{\mathrm{SFT}^{\mathrm{rubric}}} = -\mathbb{E}_{(x, y^+, y^-, \mathcal{R}^*)} \sum_{t=1}^{|\mathcal{R}^*|} \log p_\theta(\mathcal{R}^*_t \mid x, y^+, y^-, \mathcal{R}^*_{<t})$

The judge accepts $[x; y^+; y^-; \mathcal{R}(x)]$ and outputs the preference label, similarly trained with cross-entropy over label tokens:

$\mathcal{L}_{\mathrm{SFT}^{\mathrm{rm}}} = -\mathbb{E}_{(x, y^+, y^-, R^*, \ell)} \sum_{t=1}^{|\ell|} \log p_\phi(\ell_t \mid x, y^+, y^-, R^*, \ell_{<t})$

Key configuration parameters for Rubric-RM-8B include batch size 64, learning rate $5 \times 10^{-6}$ , epochs 2, and max tokens per sample 6144 (Liu et al., 9 Oct 2025).

4. End-to-End Workflow and Integration

The OpenRubrics pipeline proceeds as follows:

Triplet collection: $\{ (x, \hat y^+, \hat y^-, \ell) \}$
Application of CRG + rejection sampling yields filtered rubrics $\mathcal{R}^*(x)$ .
Supervised fine-tuning of $g_\theta$ on $\mathcal D_{\mathrm{rubric}}$ .
Supervised fine-tuning of $r_\phi$ on preference labels conditioned on rubrics.
Inference:
- Generate rubric for new response pair: $\hat{\mathcal{R}}(x) = g_\theta(x, y^A, y^B)$ .
- Compute $\hat{\ell} = \arg\max_{k \in \{A, B\}} p_\phi(k \mid x, y^A, y^B, \hat{\mathcal{R}}(x))$ .

Integration enables interpretability, modular rubric updating, and inference-time amortization. This structure generalizes across standard RLHF and principle-driven alignment paradigms (Liu et al., 9 Oct 2025).

5. Scalability, Benchmark Performance, and Policy Transfer

Empirical evaluation demonstrates Rubric-RM’s superiority across multiple reward-modeling benchmarks (RewardBench, RM-Bench, IFBench), with Rubric-RM-4B achieving an average 65.6% accuracy and Rubric-RM-8B reaching 68.5%. Ensemble voting (Rubric-RM-8B-voting@5) achieves 71.2%, closely approximating larger commercial RMs. Policy fine-tuning with DPO shows +3–4 point improvements on instruction-following (IFEval, InfoBench), and best open-source performance (∼ 50–57% wins) on Arena-Hard and AlpacaEval. Biomedical benchmarks (HealthBench) reflect similarly robust gains: Rubric-RM-8B records 68.3% vs. baseline 63.3%; ensemble voting approaches commercial results (72.9%) (Liu et al., 9 Oct 2025).

Amortizable rubric generation substantially reduces wall-clock time per evaluation: Rubric-RM-8B clocks 130 s/100 pairs, outperforming RRM-7B (203 s) and RM-R1-14B (322–382 s).

6. Alignment Signal, Interpretability, and Principle-Driven Reward Modeling

Contrastively generated, consistency-filtered rubrics provide multifaceted, interpretable alignment signals compared to previous scalar or generative reasoning-based reward models. OpenRubrics scaffolds the transition toward principle-driven paradigms, narrowing the gap between costly human evaluation and automated alignment. Structured rubrics not only serve as reward functions but also inform model interpretability and debugging—each rubric is traceable to explicit and implicit response qualities. Rubric synthesis and integration protocols facilitate ongoing rubric refinement and transferability across domains, supporting robust evaluation under reinforcement learning and instruction-following (Liu et al., 9 Oct 2025). A plausible implication is that further scaling or hybridization with dynamic online rubric elicitation (cf. OnlineRubrics (Rezaei et al., 8 Oct 2025)) may yield even more adaptive and resilient alignment frameworks.

Markdown Report Issue Upgrade to Chat

References (2)

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment (2025)

Online Rubrics Elicitation from Pairwise Comparisons (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenRubrics Architecture.

OpenRubrics Architecture

1. Dataset Construction Pipeline

2. Contrastive Rubric Generation (CRG) and Rejection Sampling

3. Rubric-RM Reward Model Architecture

4. End-to-End Workflow and Integration

5. Scalability, Benchmark Performance, and Policy Transfer

6. Alignment Signal, Interpretability, and Principle-Driven Reward Modeling

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OpenRubrics Architecture

1. Dataset Construction Pipeline

2. Contrastive Rubric Generation (CRG) and Rejection Sampling

3. Rubric-RM Reward Model Architecture

4. End-to-End Workflow and Integration

5. Scalability, Benchmark Performance, and Policy Transfer

6. Alignment Signal, Interpretability, and Principle-Driven Reward Modeling

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research