OpenRubrics Dataset Overview

Updated 10 May 2026

OpenRubrics is a structured dataset with 53,398 prompt–rubric pairs featuring both hard rules and principles for multi-dimensional evaluations.
Its Contrastive Rubric Generation framework extracts explicit and implicit criteria from response pairs, achieving a 98.2% preference-label consistency.
Benchmark results demonstrate rubric-based models can improve performance by up to 9.5% in specialized tasks, strengthening LLM alignment.

OpenRubrics is a large-scale, diverse dataset for scalable synthetic rubric generation, developed to address the limitations of scalar or pairwise judgments in reward modeling for reinforcement learning from human feedback (RLHF). By providing a corpus of structured, multi-dimensional rubrics aligned to various prompt domains, OpenRubrics enables the training and evaluation of rubric-based reward models, with demonstrable gains in alignment and downstream policy optimization for LLMs (Liu et al., 9 Oct 2025).

1. Dataset Composition and Statistics

OpenRubrics comprises 53,398 unique prompt–rubric pairs. The dataset’s domain distribution is as follows:

Domain	# Prompt–Rubric Pairs	Percentage
Instruction-following	23,184	43.4%
Biomedical	12,817	24.0%
Open-domain QA	10,560	19.8%
Coding/Math	6,837	12.8%

Rubrics are decomposed into two types:

Hard rules: Explicit, often verifiable constraints (e.g., “The response must use only information present in the passage”).
Principles: Implicit, qualitative criteria (e.g., “The response should demonstrate clarity and logical structure”).

Each rubric specifies between 3 and 8 dimensions (mean = 5.2 per rubric), with domains such as biomedical and coding exhibiting more multi-faceted rubrics than general instruction-following tasks.

2. Data Format and Access

All OpenRubrics data is released in JSON format. Each sample consists of a prompt field, a rubric list, and associated metadata. The file structure is standardized as follows:

$x$ 1

JSON fields: prompt, domain, rubric (list of objects with type and description), metadata (containing rubric_dimensions, source, preference_label_consistency).
Repository & access: The dataset and related code are available under the CC BY 4.0 License at https://github.com/OpenRubrics/OpenRubrics and https://huggingface.co/datasets/OpenRubrics/OpenRubrics.
Licensing: CC BY 4.0 (academic and commercial use; attribution required).

3. Generation Methodology

OpenRubrics employs the Contrastive Rubric Generation (CRG) framework. CRG decomposes rubric synthesis into the following steps:

Response Contrasting: For each prompt, a preferred and a rejected response are selected (using preexisting preference datasets or synthetic LLM outputs).
Component Extraction: CRG prompts an LLM to analyze the contrast between preferred and rejected responses to explicitly enumerate both:
- Hard rules (directly violated in the rejected response but satisfied in the preferred one)
- Principles (qualities more subtly present or absent, e.g., relevance, informativeness)
Rubric Assembly: The set of hard rules and principles is combined to form a comprehensive multi-dimensional rubric for the given prompt.

The CRG loss function is defined as:

$\mathcal{L}_{\text{CRG}} = \mathbb{E}_{(x, y^+, y^-)} \left[ - \sum_{k=1}^D \log P( r_k^+ > r_k^- \mid x ) \right]$

where $x$ is the prompt, $y^+$ / $y^-$ are the preferred/rejected responses, $r_k^+$ / $r_k^-$ represent rubric satisfaction on dimension $k$ , and $D$ is the number of rubric dimensions per pair.

4. Quality Control and Reliability

To maximize the reliability of rubrics and prevent alignment drift or ambiguity, OpenRubrics employs a preference-label consistency framework:

Preference-Label Consistency: For each generated rubric, agreement is measured between the relative scorings of preferred and rejected responses. This is formalized as:

$\text{Consistency}(r) = \frac{1}{N} \sum_{i=1}^N \mathbb{I}\left[ r(x_i, y_i^+) > r(x_i, y_i^-) \right]$

where $r(\cdot,\cdot)$ is the rubric-based score function, and $x$ 0 is the number of evaluated prompt–response pairs.

Rejection Sampling: Rubrics exhibiting preference-label consistency under a threshold (set at 95%) are filtered out, enforcing that the induced scoring function reliably prefers designated preferred responses.

This results in a measured label consistency of 98.2% across the dataset, with detailed rejection logs released as part of the metadata.

5. Benchmarking and Applications

OpenRubrics is used to train Rubric-RM, a rubric-based reward model designed for reward modeling and LLM alignment. The following are key results:

Reward-Modeling Benchmarks: Rubric-RM trained on OpenRubrics demonstrates a mean improvement of 6.8% over strong size-matched baselines on standard preference and reward modeling benchmarks.
Alignment Improvement: Rubric-based signals enable model alignment with nuanced human-like standards, outperforming scalar judgment regimes.
Transfer to Policy Models: Rubric-RM is incorporated for policy fine-tuning by using rubric-derived rewards in reinforcement learning protocols.

Performance transfer results include:

Instruction-Following Evaluation: On instruction-following benchmarks, Rubric-RM aligned models achieve a 4.2% absolute gain in human agreement metrics over scalar reward modeling baselines.
Biomedical QA: In biomedical answer generation, rubric-trained models register a 9.5% incremental improvement in F1-based utility, evidencing rubric-derived rewards' efficacy in specialized domains.

6. Example Entries

Representative prompt–rubric pairs from OpenRubrics illustrate the diversity and granularity of alignment signals:

Example 1: (Instruction-Following) $x$ 2

Example 2: (Biomedical) $x$ 3

Example 3: (Coding/Math) $x$ 4

References

"OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment" (Liu et al., 9 Oct 2025)

Markdown Report Issue Upgrade to Chat

References (1)

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenRubrics Dataset.

OpenRubrics Dataset Overview

1. Dataset Composition and Statistics

2. Data Format and Access

3. Generation Methodology

4. Quality Control and Reliability

5. Benchmarking and Applications

6. Example Entries

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OpenRubrics Dataset Overview

1. Dataset Composition and Statistics

2. Data Format and Access

3. Generation Methodology

4. Quality Control and Reliability

5. Benchmarking and Applications

6. Example Entries

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research