Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

139 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains (2503.23829v2)

Published 31 Mar 2025 in cs.CL

Abstract: Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of LLMs, especially when structured reference answers are accessible for verification. However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains, significantly outperforming state-of-the-art open-source aligned models such as Qwen2.5-72B and DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach notably enhances the robustness, flexibility, and scalability of RLVR, representing a substantial step towards practical reinforcement learning applications in complex, noisy-label scenarios.

Summary

The paper introduces a distilled generative reward model to extend RLVR beyond structured tasks by verifying free-form answers.
It employs model-based soft scoring and high LLM agreement to deliver nuanced reward signals, reducing the need for extensive human annotation.
Experimental results show that 7B models fine-tuned with this approach outperform larger state-of-the-art LLMs across diverse domains.

This work investigates the extension of Reinforcement Learning with Verifiable Rewards (RLVR) beyond its typical applications in mathematical reasoning and coding, where structured reference answers facilitate verification. The primary goal is to apply RLVR to more diverse domains such as medicine, chemistry, psychology, and economics, particularly in scenarios involving free-form answers where objective verification is less straightforward.

LLM Agreement and Reward Modeling

A key observation motivating this research is the high degree of agreement found among different LLMs when making binary judgments (correct/incorrect) on tasks where objective reference answers are available, even across diverse domains. This high inter-annotator reliability among LLMs suggests that they can potentially serve as reliable verifiers, potentially circumventing the need for extensive human annotation to train domain-specific reward models. This finding challenges the conventional approach which often relies on large datasets annotated specifically for reward modeling within each target domain. The paper posits that a general-purpose LLM, or a distilled version thereof, could act as a universal verifier if properly prompted or fine-tuned for the verification task.

Handling Unstructured Answers with Soft Scoring

While binary rewards are effective for tasks with clear right/wrong answers (e.g., math problems with a single numerical solution), they prove insufficient for evaluating the quality of free-form, unstructured answers common in domains like psychology or economics. Simple binary verification fails to capture nuances, partial correctness, or different valid perspectives. To address this limitation, the paper proposes incorporating model-based soft scoring into the RLVR framework. Instead of a binary reward, the reward model outputs a continuous score reflecting the quality, relevance, or correctness of the generated answer relative to a (potentially unstructured) reference. This allows for finer-grained feedback during RL training, better guiding the policy model towards generating high-quality, nuanced responses. The mechanism for soft scoring likely involves using an LLM to evaluate the generated response against the reference, outputting a score based on criteria like factual accuracy, completeness, coherence, and relevance.

Distilled Generative Reward Model as a Cross-Domain Verifier

The core technical contribution is the development and evaluation of a distilled generative reward model designed to function as an effective cross-domain verifier. This model aims to provide reliable reward signals for RL fine-tuning without necessitating domain-specific annotations. Distillation is likely employed to transfer the verification capabilities of a larger, more capable LLM (or an ensemble) into a smaller, more efficient model (e.g., a 7B parameter model, as used in the policy fine-tuning). This distilled model acts as the reward function $R(s, a)$ in the RL loop, where $s$ represents the context/prompt and $a$ represents the generated answer. The generative nature of the reward model might imply it can not only score but potentially also provide explanations or critiques, although the primary function in the RL loop is providing the scalar reward signal.

The training process for this distilled reward model likely involves:

Selecting a diverse dataset covering the target domains (medicine, chemistry, etc.) with prompts and reference answers (both structured and unstructured).
Using one or more capable teacher LLMs to generate judgments (binary or soft scores) for model-generated answers compared against the references.
Training the smaller student reward model to mimic the judgments of the teacher LLM(s) on this dataset.

Experimental Validation and Results

To validate the approach, the researchers fine-tuned a base 7B parameter LLM using various RL algorithms (the specific algorithms like PPO, DPO, etc., are not detailed in the abstract but are common choices for LLM fine-tuning) against the rewards provided by their distilled generative reward model. The performance of these fine-tuned models was then evaluated on benchmark datasets across the target domains, focusing on free-form answer generation tasks.

The results indicate that the 7B models fine-tuned using this cross-domain RLVR approach significantly outperformed state-of-the-art open-source aligned LLMs, including substantially larger models like Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B. This is a strong claim, suggesting that targeted RL fine-tuning using an effective, albeit distilled, reward model can lead to performance gains that surpass those achieved by simply scaling model size or using standard alignment techniques like instruction tuning or rejection sampling fine-tuning (RSFT) alone. The experiments demonstrated the efficacy of the method across diverse domains and its ability to handle free-form answers, strengthening the case for RLVR's robustness and scalability.

Practical Implications and Implementation

The practical implications of this work are significant for applying RL to improve LLMs in real-world settings:

Reduced Annotation Cost: By leveraging LLMs as verifiers and distilling a cross-domain reward model, the need for expensive, domain-specific human annotation for reward modeling can potentially be reduced.
Broadened Applicability of RLVR: The method extends the applicability of RLVR beyond niche domains with structured answers to a wider range of tasks involving nuanced, free-form text generation.
Handling Noisy/Weak Labels: The framework shows potential for applications where only noisy or weak labels are available, as the LLM-based verifier might be trained to handle such imperfections.
Efficient High-Performance Models: It demonstrates a pathway to achieving high performance with smaller models (e.g., 7B) through targeted RL fine-tuning, potentially outperforming much larger models on specific task distributions defined by the reward model.

Implementation considerations would involve:

Choice of Base LLM: Selecting an appropriate base model for fine-tuning (e.g., Llama, Mistral 7B).
Reward Model Distillation: Implementing the distillation process, requiring access to capable teacher LLM(s) and a diverse dataset for generating reward signals.
RL Algorithm Selection: Choosing and implementing a suitable RL algorithm (e.g., PPO) for LLM fine-tuning, managing exploration-exploitation trade-offs, and ensuring training stability.
Computational Resources: RL fine-tuning requires significant computational resources, although potentially less than training large models from scratch. The reward model itself also needs to be served during training.
Evaluation: Rigorous evaluation across multiple domains and metrics is crucial to validate performance gains.

def fine_tune_policy_with_rlvr(
    base_policy_model,
    distilled_reward_model,
    rl_algorithm, # e.g., PPO
    dataset, # Prompts for generation
    num_training_steps
):
    """
    Fine-tunes a base policy LLM using RLVR with a distilled reward model.
    """
    policy = base_policy_model
    optimizer = initialize_optimizer(policy.parameters())
    scheduler = initialize_scheduler(optimizer)

    for step in range(num_training_steps):
        # 1. Sample prompts from the dataset
        prompts = dataset.sample_batch()

        # 2. Generate responses using the current policy
        # Requires handling tokenization, generation parameters (temp, top_p)
        generated_responses = policy.generate(prompts)

        # 3. Score the generated responses using the distilled reward model
        # Reward model takes prompt + response and outputs a scalar reward
        rewards = distilled_reward_model.score(prompts, generated_responses)

        # 4. Perform RL update step (e.g., PPO)
        # This involves calculating advantages, policy loss, value loss, etc.
        # and updating the policy model's parameters.
        loss = rl_algorithm.update_step(
            prompts=prompts,
            generated_responses=generated_responses,
            rewards=rewards,
            policy=policy,
            # May need value function, old policy probabilities depending on algo
        )

        # 5. Optimizer and scheduler steps
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        if step % log_interval == 0:
            print(f"Step: {step}, Loss: {loss.item()}, Avg Reward: {rewards.mean().item()}")

    return policy # Return the fine-tuned policy model

class DistilledRewardModel:
    def __init__(self, model_path):
        # Load the pre-trained/distilled reward model
        self.model = load_model(model_path)
        self.tokenizer = load_tokenizer(model_path)

    def score(self, prompts, responses):
        # Format input for the reward model
        inputs = format_input_for_reward_model(prompts, responses, self.tokenizer)
        # Run inference
        with torch.no_grad():
            scores = self.model(**inputs).logits # Assuming logits represent score
        return process_scores(scores) # e.g., sigmoid, scaling

Conclusion

This paper presents a method for extending Reinforcement Learning with Verifiable Rewards (RLVR) to diverse domains lacking structured answers by utilizing LLMs for verification and employing model-based soft scoring. The use of a distilled generative reward model demonstrates potential for achieving significant performance improvements in free-form generation tasks with relatively smaller models, outperforming larger state-of-the-art counterparts and reducing reliance on domain-specific reward annotations. This enhances the scalability and practical applicability of RL for fine-tuning LLMs across a broader spectrum of real-world tasks.

PDF Markdown

Tweets

https://twitter.com/tuzhaopeng/status/1906975869538914570

https://twitter.com/fly51fly/status/1908633004043624617

https://twitter.com/itsquibloo/status/1909669609696796760

https://twitter.com/MiyaniDishant/status/1907720462815367404

https://twitter.com/knishimae0531/status/1908677157972574584

https://twitter.com/hackernewstop5/status/1909663218831966667

YouTube

Show All Videos

HackerNews

Can reinforcement learning for LLMs scale beyond math and coding tasks? Probably (6 points, 4 comments)