Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains (2503.23829v2)

Published 31 Mar 2025 in cs.CL

Abstract: Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of LLMs, especially when structured reference answers are accessible for verification. However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains, significantly outperforming state-of-the-art open-source aligned models such as Qwen2.5-72B and DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach notably enhances the robustness, flexibility, and scalability of RLVR, representing a substantial step towards practical reinforcement learning applications in complex, noisy-label scenarios.

Summary

  • The paper introduces a distilled generative reward model to extend RLVR beyond structured tasks by verifying free-form answers.
  • It employs model-based soft scoring and high LLM agreement to deliver nuanced reward signals, reducing the need for extensive human annotation.
  • Experimental results show that 7B models fine-tuned with this approach outperform larger state-of-the-art LLMs across diverse domains.

This work investigates the extension of Reinforcement Learning with Verifiable Rewards (RLVR) beyond its typical applications in mathematical reasoning and coding, where structured reference answers facilitate verification. The primary goal is to apply RLVR to more diverse domains such as medicine, chemistry, psychology, and economics, particularly in scenarios involving free-form answers where objective verification is less straightforward.

LLM Agreement and Reward Modeling

A key observation motivating this research is the high degree of agreement found among different LLMs when making binary judgments (correct/incorrect) on tasks where objective reference answers are available, even across diverse domains. This high inter-annotator reliability among LLMs suggests that they can potentially serve as reliable verifiers, potentially circumventing the need for extensive human annotation to train domain-specific reward models. This finding challenges the conventional approach which often relies on large datasets annotated specifically for reward modeling within each target domain. The paper posits that a general-purpose LLM, or a distilled version thereof, could act as a universal verifier if properly prompted or fine-tuned for the verification task.

Handling Unstructured Answers with Soft Scoring

While binary rewards are effective for tasks with clear right/wrong answers (e.g., math problems with a single numerical solution), they prove insufficient for evaluating the quality of free-form, unstructured answers common in domains like psychology or economics. Simple binary verification fails to capture nuances, partial correctness, or different valid perspectives. To address this limitation, the paper proposes incorporating model-based soft scoring into the RLVR framework. Instead of a binary reward, the reward model outputs a continuous score reflecting the quality, relevance, or correctness of the generated answer relative to a (potentially unstructured) reference. This allows for finer-grained feedback during RL training, better guiding the policy model towards generating high-quality, nuanced responses. The mechanism for soft scoring likely involves using an LLM to evaluate the generated response against the reference, outputting a score based on criteria like factual accuracy, completeness, coherence, and relevance.

Distilled Generative Reward Model as a Cross-Domain Verifier

The core technical contribution is the development and evaluation of a distilled generative reward model designed to function as an effective cross-domain verifier. This model aims to provide reliable reward signals for RL fine-tuning without necessitating domain-specific annotations. Distillation is likely employed to transfer the verification capabilities of a larger, more capable LLM (or an ensemble) into a smaller, more efficient model (e.g., a 7B parameter model, as used in the policy fine-tuning). This distilled model acts as the reward function R(s,a)R(s, a) in the RL loop, where ss represents the context/prompt and aa represents the generated answer. The generative nature of the reward model might imply it can not only score but potentially also provide explanations or critiques, although the primary function in the RL loop is providing the scalar reward signal.

The training process for this distilled reward model likely involves:

  1. Selecting a diverse dataset covering the target domains (medicine, chemistry, etc.) with prompts and reference answers (both structured and unstructured).
  2. Using one or more capable teacher LLMs to generate judgments (binary or soft scores) for model-generated answers compared against the references.
  3. Training the smaller student reward model to mimic the judgments of the teacher LLM(s) on this dataset.

Experimental Validation and Results

To validate the approach, the researchers fine-tuned a base 7B parameter LLM using various RL algorithms (the specific algorithms like PPO, DPO, etc., are not detailed in the abstract but are common choices for LLM fine-tuning) against the rewards provided by their distilled generative reward model. The performance of these fine-tuned models was then evaluated on benchmark datasets across the target domains, focusing on free-form answer generation tasks.

The results indicate that the 7B models fine-tuned using this cross-domain RLVR approach significantly outperformed state-of-the-art open-source aligned LLMs, including substantially larger models like Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B. This is a strong claim, suggesting that targeted RL fine-tuning using an effective, albeit distilled, reward model can lead to performance gains that surpass those achieved by simply scaling model size or using standard alignment techniques like instruction tuning or rejection sampling fine-tuning (RSFT) alone. The experiments demonstrated the efficacy of the method across diverse domains and its ability to handle free-form answers, strengthening the case for RLVR's robustness and scalability.

Practical Implications and Implementation

The practical implications of this work are significant for applying RL to improve LLMs in real-world settings:

  1. Reduced Annotation Cost: By leveraging LLMs as verifiers and distilling a cross-domain reward model, the need for expensive, domain-specific human annotation for reward modeling can potentially be reduced.
  2. Broadened Applicability of RLVR: The method extends the applicability of RLVR beyond niche domains with structured answers to a wider range of tasks involving nuanced, free-form text generation.
  3. Handling Noisy/Weak Labels: The framework shows potential for applications where only noisy or weak labels are available, as the LLM-based verifier might be trained to handle such imperfections.
  4. Efficient High-Performance Models: It demonstrates a pathway to achieving high performance with smaller models (e.g., 7B) through targeted RL fine-tuning, potentially outperforming much larger models on specific task distributions defined by the reward model.

Implementation considerations would involve:

  • Choice of Base LLM: Selecting an appropriate base model for fine-tuning (e.g., Llama, Mistral 7B).
  • Reward Model Distillation: Implementing the distillation process, requiring access to capable teacher LLM(s) and a diverse dataset for generating reward signals.
  • RL Algorithm Selection: Choosing and implementing a suitable RL algorithm (e.g., PPO) for LLM fine-tuning, managing exploration-exploitation trade-offs, and ensuring training stability.
  • Computational Resources: RL fine-tuning requires significant computational resources, although potentially less than training large models from scratch. The reward model itself also needs to be served during training.
  • Evaluation: Rigorous evaluation across multiple domains and metrics is crucial to validate performance gains.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def fine_tune_policy_with_rlvr(
    base_policy_model,
    distilled_reward_model,
    rl_algorithm, # e.g., PPO
    dataset, # Prompts for generation
    num_training_steps
):
    """
    Fine-tunes a base policy LLM using RLVR with a distilled reward model.
    """
    policy = base_policy_model
    optimizer = initialize_optimizer(policy.parameters())
    scheduler = initialize_scheduler(optimizer)

    for step in range(num_training_steps):
        # 1. Sample prompts from the dataset
        prompts = dataset.sample_batch()

        # 2. Generate responses using the current policy
        # Requires handling tokenization, generation parameters (temp, top_p)
        generated_responses = policy.generate(prompts)

        # 3. Score the generated responses using the distilled reward model
        # Reward model takes prompt + response and outputs a scalar reward
        rewards = distilled_reward_model.score(prompts, generated_responses)

        # 4. Perform RL update step (e.g., PPO)
        # This involves calculating advantages, policy loss, value loss, etc.
        # and updating the policy model's parameters.
        loss = rl_algorithm.update_step(
            prompts=prompts,
            generated_responses=generated_responses,
            rewards=rewards,
            policy=policy,
            # May need value function, old policy probabilities depending on algo
        )

        # 5. Optimizer and scheduler steps
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        if step % log_interval == 0:
            print(f"Step: {step}, Loss: {loss.item()}, Avg Reward: {rewards.mean().item()}")

    return policy # Return the fine-tuned policy model

class DistilledRewardModel:
    def __init__(self, model_path):
        # Load the pre-trained/distilled reward model
        self.model = load_model(model_path)
        self.tokenizer = load_tokenizer(model_path)

    def score(self, prompts, responses):
        # Format input for the reward model
        inputs = format_input_for_reward_model(prompts, responses, self.tokenizer)
        # Run inference
        with torch.no_grad():
            scores = self.model(**inputs).logits # Assuming logits represent score
        return process_scores(scores) # e.g., sigmoid, scaling

Conclusion

This paper presents a method for extending Reinforcement Learning with Verifiable Rewards (RLVR) to diverse domains lacking structured answers by utilizing LLMs for verification and employing model-based soft scoring. The use of a distilled generative reward model demonstrates potential for achieving significant performance improvements in free-form generation tasks with relatively smaller models, outperforming larger state-of-the-art counterparts and reducing reliance on domain-specific reward annotations. This enhances the scalability and practical applicability of RL for fine-tuning LLMs across a broader spectrum of real-world tasks.

Youtube Logo Streamline Icon: https://streamlinehq.com