OpenRM: Tool-Augmented Reward Model

Updated 19 February 2026

OpenRM is a tool-augmented generative reward model that evaluates large language models by integrating external evidence sources.
It employs a sequential decision process to dynamically select retrieval tools like Wikipedia and arXiv, ensuring evidence-based adjudication.
Trained via Group Relative Policy Optimization on 27K synthesized preference pairs, OpenRM achieves state-of-the-art performance on diverse benchmarks.

OpenRM (OpenReward) is a tool-augmented, generative reward model for the evaluation and alignment of LLMs on knowledge-intensive, long-form, and agentic tasks. Unlike traditional reward models that operate exclusively on the LLM’s internal state, OpenRM autonomously invokes external evidence sources (notably Wikipedia and arXiv) during the adjudication process. Evaluation is formulated as a sequential decision process wherein the model dynamically selects retrieval tools, integrates retrieved evidence, and ultimately issues a preference judgment. OpenRM is trained using Group Relative Policy Optimization (GRPO) on a large corpus of synthesized preference data, with joint supervision of both tool use and final judgment accuracy, and achieves substantial empirical gains over previous approaches in both in-domain and out-of-domain benchmarks (Hu et al., 28 Oct 2025).

1. Sequential Tool-Augmented Evaluation and Core Architecture

OpenRM operates atop a frozen LLM backbone (Qwen-2.5-7B-Instruct), augmented with two specialized external retrieval tools: Wikipedia Search (2018 dump, ColBERT-v2 index) and arXiv Search (via LitSearch on scientific corpora). Given a query $q$ and two candidate responses $(x_1, x_2)$ , the evaluation unfolds as a finite-horizon Markov decision process:

State $s_i$ : $[q,\, x_1,\, x_2,\, \{(t_j, e_j)\}_{j < i}]$ , capturing the query, two responses, and the sequence of prior tool calls and retrieved evidence.
Action $a_i$ : either selection of a tool $t_i \in \{\mathrm{WikiSearch}, \mathrm{ArxivSearch}\}$ or the “Stop” action (emitting a final preference $y \in \{\mathrm{A}, \mathrm{B}\}$ ).
Transition: Tool actions invoke external search; evidence is appended to the working context. Termination is triggered when “Stop” is chosen, yielding the model’s final preference verdict.

The policy $\pi_\theta$ samples at each step: $t_i \sim \pi_\theta(t\,|\,s_i)$ Tool invocation is bounded ( $n$ steps maximum), ensuring tractability. This process allows OpenRM to perform context-driven evidence gathering, in contrast to both scalar RMs and LLM-as-judge baselines, which lack explicit external querying capacity.

2. Training via Group Relative Policy Optimization (GRPO)

OpenRM is trained with a group-based variant of Proximal Policy Optimization (PPO), termed Group Relative Policy Optimization (GRPO). For each query, $m$ sampled trajectories are generated, and rewards are defined as a composite function: $R(T) = R_{\mathrm{EM}}(T) + \mathrm{sign}\bigl(R_{\mathrm{EM}}(T)\bigr)\,\alpha\,R_{\mathrm{tool}}(T)$ where:

$R_{\mathrm{EM}}(T)$ : 1 if the final preference matches ground-truth, else 0.
$R_{\mathrm{tool}}(T)$ : count of appropriate tool calls within the trajectory.
$\alpha$ : balancing parameter (default 0.5).

The group-relative advantage is computed as: $A_i = \frac{R(T_i) - \frac{1}{m}\sum_{j=1}^m R(T_j)}{\mathrm{std}(\{R(T_j)\}_{j=1}^m)}$ The policy gradient objective with clipping and KL regularization is given by: $J(\theta) = \mathbb{E} \Bigl[ \min(r_iA_i, \mathrm{clip}(r_i, 1-\epsilon, 1+\epsilon)A_i ) - \beta\,\mathrm{KL}(\pi_\theta\,||\,\pi_{\theta_\mathrm{ref}})\Bigr]$ where $r_i = \pi_\theta(a_i|s_i) / \pi_{\theta_{\mathrm{old}}}(a_i|s_i)$ , $\epsilon$ and $\beta$ are hyperparameters. Supervision is exclusively via RL: no cross-entropy loss is imposed on tool steps.

3. Data Synthesis Pipeline and Supervisory Signals

Reward model training relies on over 27,000 automatically synthesized preference pairs across three domains: Wikipedia QA, scientific surveys, and medical QA.

Data generation: Domain-specific documents are gathered; a high-capacity LLM (DeepSeek-V3) is prompted to produce self-contained, document-grounded queries.
Label construction: For each query $q$ and reference document, $x^+$ is generated with document access and $x^-$ without, assigning $x^+ \succ x^-$ .
Scale: ≥9,000 training pairs per domain; evaluation sets cover both in-domain and out-of-domain tasks.

The reward signal jointly supervises:

Intermediate tool usage ( $R_{\mathrm{tool}}$ ): provides dense, trajectory-level reward for appropriate tool calls.
Final outcome accuracy ( $R_{\mathrm{EM}}$ ): serves as the sparse, ground-truth supervisory anchor.

4. Empirical Results, Integration, and Comparative Performance

OpenRM achieves state-of-the-art results on knowledge-intensive, long-form evaluation. On in-domain tasks (Wikipedia, scientific, medical):

Model	Wiki	Scientific	Medical	Avg.
Best direct LLM-judge (GPT-4o)	70.0	48.2	44.0	54.1
Best train-based RM (RM-R1)	55.4	54.8	52.3	54.2
Agentic LLM-judge (GPT-4o+tools)	76.4	58.6	53.4	62.8
OPENRM (27K pairs)	93.0	90.0	91.0	91.3

On out-of-domain benchmarks (PandaLM, RewardBench), OpenRM outperforms larger-scale reward models even with fewer training samples:

Model	Train Data	PandaLM	RewardBench
RM-R1	72k	72.7	68.3
JudgeLRM	100k	72.3	74.4
RRM	420k	77.7	78.5
OPENRM	27k	79.4	77.7

Integration into LLM pipelines occurs both at inference (pairwise response selection, majority vote over OPENRM preferences) and at training (filtering trajectories for RLHF using Direct Preference Optimization). Algorithmic schemas are directly provided for each case in the source (Hu et al., 28 Oct 2025).

5. Model Limitations and Open Research Directions

OpenRM is subject to several concrete limitations:

Tool dependence: Accuracy is contingent on search tool coverage and retrieval fidelity; noisy or latent evidence can degrade judgment quality.
Text-only focus: Current system operates exclusively on textual evidence; extension to multimodal (image, tabular) data remains open.
Scalability: Managing additional tools, orchestrating tool selection policies, and introducing persistent memory for evidence pose algorithmic and systems challenges.
External bias vulnerability: Reliance on third-party repositories introduces possible bias and error propagation from external sources.
Unexplored areas: Open questions include automated construction of tool pipelines, more granular supervisory reward signal structures, and mitigation of external knowledge-induced bias.

6. Technical Summary: Key Formulas and Decision Loop

Composite reward:

$R(T) = R_{\mathrm{EM}}(T) + \mathrm{sign}\bigl(R_{\mathrm{EM}}(T)\bigr)\,\alpha\,R_{\mathrm{tool}}(T)$

Group-relative advantage:

$A_i = \frac{R(T_i) - \tfrac{1}{m}\sum_{j=1}^m R(T_j)}{\mathrm{std}(\{R(T_j)\})}$

GRPO objective:

$J(\theta) = \mathbb{E}\Bigl[\min(r_i A_i,\; \mathrm{clip}(r_i, 1-\epsilon, 1+\epsilon)A_i) - \beta\,\mathrm{KL}(\pi_\theta \| \pi_{\theta_\mathrm{ref}})\Bigr]$

Decision process (pseudocode):

state = [q, x1, x2]
for i in 1...n:
    a_i ~ πθ(· | state)
    if a_i is a tool:
        evidence = exec_tool(a_i, state)
        state.append(evidence)
    else:  # Stop
        y ~ πθ(y | state)
        break

Empirical results, formal methodology, and the technical framework are presented in detail by the OpenReward authors (Hu et al., 28 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenRM.

OpenRM: Tool-Augmented Reward Model

1. Sequential Tool-Augmented Evaluation and Core Architecture

2. Training via Group Relative Policy Optimization (GRPO)

3. Data Synthesis Pipeline and Supervisory Signals

4. Empirical Results, Integration, and Comparative Performance

5. Model Limitations and Open Research Directions

6. Technical Summary: Key Formulas and Decision Loop

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OpenRM: Tool-Augmented Reward Model

1. Sequential Tool-Augmented Evaluation and Core Architecture

2. Training via Group Relative Policy Optimization (GRPO)

3. Data Synthesis Pipeline and Supervisory Signals

4. Empirical Results, Integration, and Comparative Performance

5. Model Limitations and Open Research Directions

6. Technical Summary: Key Formulas and Decision Loop

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research