Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenRM: Tool-Augmented Reward Model

Updated 19 February 2026
  • OpenRM is a tool-augmented generative reward model that evaluates large language models by integrating external evidence sources.
  • It employs a sequential decision process to dynamically select retrieval tools like Wikipedia and arXiv, ensuring evidence-based adjudication.
  • Trained via Group Relative Policy Optimization on 27K synthesized preference pairs, OpenRM achieves state-of-the-art performance on diverse benchmarks.

OpenRM (OpenReward) is a tool-augmented, generative reward model for the evaluation and alignment of LLMs on knowledge-intensive, long-form, and agentic tasks. Unlike traditional reward models that operate exclusively on the LLM’s internal state, OpenRM autonomously invokes external evidence sources (notably Wikipedia and arXiv) during the adjudication process. Evaluation is formulated as a sequential decision process wherein the model dynamically selects retrieval tools, integrates retrieved evidence, and ultimately issues a preference judgment. OpenRM is trained using Group Relative Policy Optimization (GRPO) on a large corpus of synthesized preference data, with joint supervision of both tool use and final judgment accuracy, and achieves substantial empirical gains over previous approaches in both in-domain and out-of-domain benchmarks (Hu et al., 28 Oct 2025).

1. Sequential Tool-Augmented Evaluation and Core Architecture

OpenRM operates atop a frozen LLM backbone (Qwen-2.5-7B-Instruct), augmented with two specialized external retrieval tools: Wikipedia Search (2018 dump, ColBERT-v2 index) and arXiv Search (via LitSearch on scientific corpora). Given a query qq and two candidate responses (x1,x2)(x_1, x_2), the evaluation unfolds as a finite-horizon Markov decision process:

  • State sis_i: [q,x1,x2,{(tj,ej)}j<i][q,\, x_1,\, x_2,\, \{(t_j, e_j)\}_{j < i}], capturing the query, two responses, and the sequence of prior tool calls and retrieved evidence.
  • Action aia_i: either selection of a tool ti{WikiSearch,ArxivSearch}t_i \in \{\mathrm{WikiSearch}, \mathrm{ArxivSearch}\} or the “Stop” action (emitting a final preference y{A,B}y \in \{\mathrm{A}, \mathrm{B}\}).
  • Transition: Tool actions invoke external search; evidence is appended to the working context. Termination is triggered when “Stop” is chosen, yielding the model’s final preference verdict.

The policy πθ\pi_\theta samples at each step: tiπθ(tsi)t_i \sim \pi_\theta(t\,|\,s_i) Tool invocation is bounded (nn steps maximum), ensuring tractability. This process allows OpenRM to perform context-driven evidence gathering, in contrast to both scalar RMs and LLM-as-judge baselines, which lack explicit external querying capacity.

2. Training via Group Relative Policy Optimization (GRPO)

OpenRM is trained with a group-based variant of Proximal Policy Optimization (PPO), termed Group Relative Policy Optimization (GRPO). For each query, mm sampled trajectories are generated, and rewards are defined as a composite function: R(T)=REM(T)+sign(REM(T))αRtool(T)R(T) = R_{\mathrm{EM}}(T) + \mathrm{sign}\bigl(R_{\mathrm{EM}}(T)\bigr)\,\alpha\,R_{\mathrm{tool}}(T) where:

  • REM(T)R_{\mathrm{EM}}(T): 1 if the final preference matches ground-truth, else 0.
  • Rtool(T)R_{\mathrm{tool}}(T): count of appropriate tool calls within the trajectory.
  • α\alpha: balancing parameter (default 0.5).

The group-relative advantage is computed as: Ai=R(Ti)1mj=1mR(Tj)std({R(Tj)}j=1m)A_i = \frac{R(T_i) - \frac{1}{m}\sum_{j=1}^m R(T_j)}{\mathrm{std}(\{R(T_j)\}_{j=1}^m)} The policy gradient objective with clipping and KL regularization is given by: J(θ)=E[min(riAi,clip(ri,1ϵ,1+ϵ)Ai)βKL(πθπθref)]J(\theta) = \mathbb{E} \Bigl[ \min(r_iA_i, \mathrm{clip}(r_i, 1-\epsilon, 1+\epsilon)A_i ) - \beta\,\mathrm{KL}(\pi_\theta\,||\,\pi_{\theta_\mathrm{ref}})\Bigr] where ri=πθ(aisi)/πθold(aisi)r_i = \pi_\theta(a_i|s_i) / \pi_{\theta_{\mathrm{old}}}(a_i|s_i), ϵ\epsilon and β\beta are hyperparameters. Supervision is exclusively via RL: no cross-entropy loss is imposed on tool steps.

3. Data Synthesis Pipeline and Supervisory Signals

Reward model training relies on over 27,000 automatically synthesized preference pairs across three domains: Wikipedia QA, scientific surveys, and medical QA.

  • Data generation: Domain-specific documents are gathered; a high-capacity LLM (DeepSeek-V3) is prompted to produce self-contained, document-grounded queries.
  • Label construction: For each query qq and reference document, x+x^+ is generated with document access and xx^- without, assigning x+xx^+ \succ x^-.
  • Scale: ≥9,000 training pairs per domain; evaluation sets cover both in-domain and out-of-domain tasks.

The reward signal jointly supervises:

  1. Intermediate tool usage (RtoolR_{\mathrm{tool}}): provides dense, trajectory-level reward for appropriate tool calls.
  2. Final outcome accuracy (REMR_{\mathrm{EM}}): serves as the sparse, ground-truth supervisory anchor.

4. Empirical Results, Integration, and Comparative Performance

OpenRM achieves state-of-the-art results on knowledge-intensive, long-form evaluation. On in-domain tasks (Wikipedia, scientific, medical):

Model Wiki Scientific Medical Avg.
Best direct LLM-judge (GPT-4o) 70.0 48.2 44.0 54.1
Best train-based RM (RM-R1) 55.4 54.8 52.3 54.2
Agentic LLM-judge (GPT-4o+tools) 76.4 58.6 53.4 62.8
OPENRM (27K pairs) 93.0 90.0 91.0 91.3

On out-of-domain benchmarks (PandaLM, RewardBench), OpenRM outperforms larger-scale reward models even with fewer training samples:

Model Train Data PandaLM RewardBench
RM-R1 72k 72.7 68.3
JudgeLRM 100k 72.3 74.4
RRM 420k 77.7 78.5
OPENRM 27k 79.4 77.7

Integration into LLM pipelines occurs both at inference (pairwise response selection, majority vote over OPENRM preferences) and at training (filtering trajectories for RLHF using Direct Preference Optimization). Algorithmic schemas are directly provided for each case in the source (Hu et al., 28 Oct 2025).

5. Model Limitations and Open Research Directions

OpenRM is subject to several concrete limitations:

  • Tool dependence: Accuracy is contingent on search tool coverage and retrieval fidelity; noisy or latent evidence can degrade judgment quality.
  • Text-only focus: Current system operates exclusively on textual evidence; extension to multimodal (image, tabular) data remains open.
  • Scalability: Managing additional tools, orchestrating tool selection policies, and introducing persistent memory for evidence pose algorithmic and systems challenges.
  • External bias vulnerability: Reliance on third-party repositories introduces possible bias and error propagation from external sources.
  • Unexplored areas: Open questions include automated construction of tool pipelines, more granular supervisory reward signal structures, and mitigation of external knowledge-induced bias.

6. Technical Summary: Key Formulas and Decision Loop

  • Composite reward:

R(T)=REM(T)+sign(REM(T))αRtool(T)R(T) = R_{\mathrm{EM}}(T) + \mathrm{sign}\bigl(R_{\mathrm{EM}}(T)\bigr)\,\alpha\,R_{\mathrm{tool}}(T)

  • Group-relative advantage:

Ai=R(Ti)1mj=1mR(Tj)std({R(Tj)})A_i = \frac{R(T_i) - \tfrac{1}{m}\sum_{j=1}^m R(T_j)}{\mathrm{std}(\{R(T_j)\})}

  • GRPO objective:

J(θ)=E[min(riAi,  clip(ri,1ϵ,1+ϵ)Ai)βKL(πθπθref)]J(\theta) = \mathbb{E}\Bigl[\min(r_i A_i,\; \mathrm{clip}(r_i, 1-\epsilon, 1+\epsilon)A_i) - \beta\,\mathrm{KL}(\pi_\theta \| \pi_{\theta_\mathrm{ref}})\Bigr]

  • Decision process (pseudocode):

1
2
3
4
5
6
7
8
9
state = [q, x1, x2]
for i in 1...n:
    a_i ~ πθ(· | state)
    if a_i is a tool:
        evidence = exec_tool(a_i, state)
        state.append(evidence)
    else:  # Stop
        y ~ πθ(y | state)
        break

Empirical results, formal methodology, and the technical framework are presented in detail by the OpenReward authors (Hu et al., 28 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenRM.