VerInstruct Dataset for RL Verification

Updated 13 November 2025

VerInstruct is a high-quality dataset designed for reinforcement learning fine-tuning of large language models, incorporating 22,000 instruction-response pairs with both hard and soft constraints.
It integrates data from four reputable sources and employs automated Python functions alongside LLM-based verification to enforce and evaluate constraints.
The dataset supports scalable RL pipelines with comprehensive tooling for implementing verifiable reward functions, enhancing model adherence and performance.

VerInstruct is a high-quality dataset designed for reinforcement learning (RL) in instruction following, addressing the need for verifiable reward signals when fine-tuning LLMs for improved constraint adherence. Developed as part of the VerIF methodology, VerInstruct contains approximately 22,000 instruction–response pairs, each annotated with a comprehensive set of “hard” and “soft” constraints alongside engineered verification signals suitable for RL with verifiable rewards (RLVR). The dataset and associated tooling support the systematic evaluation and scalable RL training of LLMs, facilitating enhancements in both constraint-following behavior and general task performance.

1. Dataset Construction

VerInstruct is assembled from a curated selection of 25,000 instruction–response pairs sampled from four publicly available, high-quality datasets: Alpaca-GPT4, Orca-Chat, Evol-Instruct, and OpenAssistant. Each “instruction” paired with its “response” functions as a seed for the generation of explicit constraints. Constraints are classified into:

Soft constraints: Style, content, or format requirements, extracted using a constraint back-translation prompt on Llama3.1-70B-Instruct, which infers attributes implied in responses (e.g., tone, required keywords, output structure).
Hard constraints: Automatically produced rules such as length limits (threshold sampled around the response’s word count) and keyword/phrase presence constraints (randomly sampled keywords from the seed responses).

Any instruction containing fewer than two total constraints is omitted. For every hard constraint, Qwen2.5-72B-Instruct generates a self-contained Python function (signature: def check_following(response): ...) that returns a Boolean indicating satisfaction of the constraint; these functions are manually verified for correctness. Soft constraints are tagged for LLM-based verification, requiring a reasoning model (default: QwQ-32B) with chain-of-thought reasoning to assess them during RL.

Final dataset metrics:

Instructions: 22,000
Average constraints per instruction: 6.2
Constraints: ≈136,400 in total
Constraint type distribution: 22.3% hard (length, keyword); 77.7% soft (format, content, style)
Each record is a structured JSON object.

2. Verification Signal Engineering

VerInstruct provides two primary, complementary verification signals per response $y$ for instruction $x$ with constraints $C = C_h \cup C_s$ :

Rule-based (hard) verification: For each hard constraint $c_i \in C_h$ , the corresponding Python function is executed on $y$ , yielding $code_i(y) \in \{0, 1\}$ (pass/fail). These are aggregated as

$S_{rule}(y) = \frac{1}{|C_h|} \sum_{c_i \in C_h} code_i(y)$

LLM-based (soft) verification: The set of soft constraints $C_s$ and $y$ are input to QwQ-32B (default), which outputs “Yes/No” to the query “does this output satisfy all soft constraints?”. “Yes” is mapped to 1, “No” to 0: $S_{LLM}(y) \in \{0, 1\}$ .
Composite verification score: As per Equation (1) of the source, the final score is an equal-average:

$V(x,y) = VerIF(x,y) = F(S_{rule}, S_{LLM}) = \frac{1}{2} S_{rule}(y) + \frac{1}{2} S_{LLM}(y)$

General weighting is also supported: $V = \alpha \cdot S_{rule} + \beta \cdot S_{LLM}$ with $\alpha + \beta = 1$ (default: $\alpha = \beta = 0.5$ ).

Each JSON record structure includes all constraints (with their verification methods/code), instruction text, and an integer count. Score fields (hard_score, soft_score, verif_score) are generated dynamically during RL training.

3. Constraint Taxonomy and Distribution

VerInstruct does not label instructions by classical task category (e.g., Q&A, transformation, code-generation) but provides a granular taxonomy of constraint types and distributions. Among approximately 136,400 constraints:

Hard constraints: $|C_h| \approx 30,400$ (22.3%)
Soft constraints: $|C_s| \approx 106,000$ (77.7%)

Constraint type breakdown:

Constraint Type	Count	Percentage
length	20,460	15.0 %
keyword	9,940	7.3 %
format	38,500	28.2 %
content	42,900	31.5 %
style	24,600	18.0 %

Distribution of constraints per instruction (out of 22,000 records):

Constraints per Instruction	Percentage
2	8 %
3	12 %
4	18 %
5–7	45 %
8+	17 %

This structure ensures a broad coverage of constraint complexity and variation across the dataset.

4. Dataset Format and Recommended Splits

The dataset is disseminated in the JSON Lines (.jsonl) format, where each line is a complete JSON object containing instruction text, a constraints array (with unique ids, type, description, and verifier specification), and the number of constraints. The source provides the entire corpus for RL purposes; no static training/validation/test splits are included.

Recommended RL protocol:

All 22,000 instructions are used for on-policy rollout generation.
Model checkpoints are validated every 20 steps using the IFEval benchmark (not included in VerInstruct).
Researchers may implement alternative splits (e.g., $k$ -fold, 80/10/10) for offline reporting, though this is not mandated in the official release.

5. Pipeline Integration and Usage Guidelines

Integrating VerInstruct into an RL training pipeline involves several key steps:

Data preprocessing:

Read JSONL; extract “instruction” and constraints array per record.
Partition constraints by verification method (“python” for hard, “LLM” for soft).
Compile all Python verifier snippets into memory for rapid evaluation.

Reward function implementation: For a model-generated response $y$ $y$ :
- Evaluate all hard constraints; calculate $S_{rule}$ .
- One-shot prompt an LLM to assess all soft constraints; parse binary output ( $S_{LLM}$ ).
- Compute reward: $R = \frac{1}{2} S_{rule} + \frac{1}{2} S_{LLM}$ , implemented via the function VerIF(x, y).
Recommended RL algorithm configuration:
- VeRL framework with GRPO as RL method
- Hyperparameters: KL loss coefficient $10^{-3}$ ; 16 rollouts/prompt; batch size 32; learning rate $10^{-6}$ ; maximum generation length 4096 tokens
- Optional use of the distilled “IF-Verifier-7B” model for improved LLM-verifier throughput over QwQ-32B

Included tooling supports critical pipeline elements:

scripts/constraint_backtranslation.py for replicating the soft-constraint extraction step.
scripts/generate_python_verifier.py for orchestrating Python hard-constraint generation via Qwen2.5.
verl_reward.py, a ready-made reward computation processor handling constraint verification in parallel.
Example launch scripts for both verifier model configurations.

Integration of VerInstruct and the VerIF reward function enables immediate on-policy RL fine-tuning of any instruction-tuned base model, yielding empirically verified improvements in constraint-following across representative benchmarks and generalization to novel constraints. The authors report that general capabilities remain unaffected when RL with VerIF is incorporated, indicating compatibility with established RL training recipes.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to VerInstruct Dataset.