Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 186 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 65 tok/s Pro
Kimi K2 229 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

VerInstruct Dataset for RL Verification

Updated 13 November 2025
  • VerInstruct is a high-quality dataset designed for reinforcement learning fine-tuning of large language models, incorporating 22,000 instruction-response pairs with both hard and soft constraints.
  • It integrates data from four reputable sources and employs automated Python functions alongside LLM-based verification to enforce and evaluate constraints.
  • The dataset supports scalable RL pipelines with comprehensive tooling for implementing verifiable reward functions, enhancing model adherence and performance.

VerInstruct is a high-quality dataset designed for reinforcement learning (RL) in instruction following, addressing the need for verifiable reward signals when fine-tuning LLMs for improved constraint adherence. Developed as part of the VerIF methodology, VerInstruct contains approximately 22,000 instruction–response pairs, each annotated with a comprehensive set of “hard” and “soft” constraints alongside engineered verification signals suitable for RL with verifiable rewards (RLVR). The dataset and associated tooling support the systematic evaluation and scalable RL training of LLMs, facilitating enhancements in both constraint-following behavior and general task performance.

1. Dataset Construction

VerInstruct is assembled from a curated selection of 25,000 instruction–response pairs sampled from four publicly available, high-quality datasets: Alpaca-GPT4, Orca-Chat, Evol-Instruct, and OpenAssistant. Each “instruction” paired with its “response” functions as a seed for the generation of explicit constraints. Constraints are classified into:

  • Soft constraints: Style, content, or format requirements, extracted using a constraint back-translation prompt on Llama3.1-70B-Instruct, which infers attributes implied in responses (e.g., tone, required keywords, output structure).
  • Hard constraints: Automatically produced rules such as length limits (threshold sampled around the response’s word count) and keyword/phrase presence constraints (randomly sampled keywords from the seed responses).

Any instruction containing fewer than two total constraints is omitted. For every hard constraint, Qwen2.5-72B-Instruct generates a self-contained Python function (signature: def check_following(response): ...) that returns a Boolean indicating satisfaction of the constraint; these functions are manually verified for correctness. Soft constraints are tagged for LLM-based verification, requiring a reasoning model (default: QwQ-32B) with chain-of-thought reasoning to assess them during RL.

Final dataset metrics:

  • Instructions: 22,000
  • Average constraints per instruction: 6.2
  • Constraints: ≈136,400 in total
  • Constraint type distribution: 22.3% hard (length, keyword); 77.7% soft (format, content, style)
  • Each record is a structured JSON object.

2. Verification Signal Engineering

VerInstruct provides two primary, complementary verification signals per response yy for instruction xx with constraints C=ChCsC = C_h \cup C_s:

  • Rule-based (hard) verification: For each hard constraint ciChc_i \in C_h, the corresponding Python function is executed on yy, yielding codei(y){0,1}code_i(y) \in \{0, 1\} (pass/fail). These are aggregated as

Srule(y)=1ChciChcodei(y)S_{rule}(y) = \frac{1}{|C_h|} \sum_{c_i \in C_h} code_i(y)

  • LLM-based (soft) verification: The set of soft constraints CsC_s and yy are input to QwQ-32B (default), which outputs “Yes/No” to the query “does this output satisfy all soft constraints?”. “Yes” is mapped to 1, “No” to 0: SLLM(y){0,1}S_{LLM}(y) \in \{0, 1\}.
  • Composite verification score: As per Equation (1) of the source, the final score is an equal-average:

V(x,y)=VerIF(x,y)=F(Srule,SLLM)=12Srule(y)+12SLLM(y)V(x,y) = VerIF(x,y) = F(S_{rule}, S_{LLM}) = \frac{1}{2} S_{rule}(y) + \frac{1}{2} S_{LLM}(y)

General weighting is also supported: V=αSrule+βSLLMV = \alpha \cdot S_{rule} + \beta \cdot S_{LLM} with α+β=1\alpha + \beta = 1 (default: α=β=0.5\alpha = \beta = 0.5).

Each JSON record structure includes all constraints (with their verification methods/code), instruction text, and an integer count. Score fields (hard_score, soft_score, verif_score) are generated dynamically during RL training.

3. Constraint Taxonomy and Distribution

VerInstruct does not label instructions by classical task category (e.g., Q&A, transformation, code-generation) but provides a granular taxonomy of constraint types and distributions. Among approximately 136,400 constraints:

  • Hard constraints: Ch30,400|C_h| \approx 30,400 (22.3%)
  • Soft constraints: Cs106,000|C_s| \approx 106,000 (77.7%)

Constraint type breakdown:

Constraint Type Count Percentage
length 20,460 15.0 %
keyword 9,940 7.3 %
format 38,500 28.2 %
content 42,900 31.5 %
style 24,600 18.0 %

Distribution of constraints per instruction (out of 22,000 records):

Constraints per Instruction Percentage
2 8 %
3 12 %
4 18 %
5–7 45 %
8+ 17 %

This structure ensures a broad coverage of constraint complexity and variation across the dataset.

The dataset is disseminated in the JSON Lines (.jsonl) format, where each line is a complete JSON object containing instruction text, a constraints array (with unique ids, type, description, and verifier specification), and the number of constraints. The source provides the entire corpus for RL purposes; no static training/validation/test splits are included.

Recommended RL protocol:

  • All 22,000 instructions are used for on-policy rollout generation.
  • Model checkpoints are validated every 20 steps using the IFEval benchmark (not included in VerInstruct).
  • Researchers may implement alternative splits (e.g., kk-fold, 80/10/10) for offline reporting, though this is not mandated in the official release.

5. Pipeline Integration and Usage Guidelines

Integrating VerInstruct into an RL training pipeline involves several key steps:

  • Data preprocessing:
  1. Read JSONL; extract “instruction” and constraints array per record.
  2. Partition constraints by verification method (“python” for hard, “LLM” for soft).
  3. Compile all Python verifier snippets into memory for rapid evaluation.
  • Reward function implementation: For a model-generated response yy:
    • Evaluate all hard constraints; calculate SruleS_{rule}.
    • One-shot prompt an LLM to assess all soft constraints; parse binary output (SLLMS_{LLM}).
    • Compute reward: R=12Srule+12SLLMR = \frac{1}{2} S_{rule} + \frac{1}{2} S_{LLM}, implemented via the function VerIF(x, y).
  • Recommended RL algorithm configuration:
    • VeRL framework with GRPO as RL method
    • Hyperparameters: KL loss coefficient 10310^{-3}; 16 rollouts/prompt; batch size 32; learning rate 10610^{-6}; maximum generation length 4096 tokens
    • Optional use of the distilled “IF-Verifier-7B” model for improved LLM-verifier throughput over QwQ-32B

Included tooling supports critical pipeline elements:

  • scripts/constraint_backtranslation.py for replicating the soft-constraint extraction step.
  • scripts/generate_python_verifier.py for orchestrating Python hard-constraint generation via Qwen2.5.
  • verl_reward.py, a ready-made reward computation processor handling constraint verification in parallel.
  • Example launch scripts for both verifier model configurations.

Integration of VerInstruct and the VerIF reward function enables immediate on-policy RL fine-tuning of any instruction-tuned base model, yielding empirically verified improvements in constraint-following across representative benchmarks and generalization to novel constraints. The authors report that general capabilities remain unaffected when RL with VerIF is incorporated, indicating compatibility with established RL training recipes.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VerInstruct Dataset.