OpenRM: Tool-Augmented Reward Model
- OpenRM is a tool-augmented generative reward model that evaluates large language models by integrating external evidence sources.
- It employs a sequential decision process to dynamically select retrieval tools like Wikipedia and arXiv, ensuring evidence-based adjudication.
- Trained via Group Relative Policy Optimization on 27K synthesized preference pairs, OpenRM achieves state-of-the-art performance on diverse benchmarks.
OpenRM (OpenReward) is a tool-augmented, generative reward model for the evaluation and alignment of LLMs on knowledge-intensive, long-form, and agentic tasks. Unlike traditional reward models that operate exclusively on the LLM’s internal state, OpenRM autonomously invokes external evidence sources (notably Wikipedia and arXiv) during the adjudication process. Evaluation is formulated as a sequential decision process wherein the model dynamically selects retrieval tools, integrates retrieved evidence, and ultimately issues a preference judgment. OpenRM is trained using Group Relative Policy Optimization (GRPO) on a large corpus of synthesized preference data, with joint supervision of both tool use and final judgment accuracy, and achieves substantial empirical gains over previous approaches in both in-domain and out-of-domain benchmarks (Hu et al., 28 Oct 2025).
1. Sequential Tool-Augmented Evaluation and Core Architecture
OpenRM operates atop a frozen LLM backbone (Qwen-2.5-7B-Instruct), augmented with two specialized external retrieval tools: Wikipedia Search (2018 dump, ColBERT-v2 index) and arXiv Search (via LitSearch on scientific corpora). Given a query and two candidate responses , the evaluation unfolds as a finite-horizon Markov decision process:
- State : , capturing the query, two responses, and the sequence of prior tool calls and retrieved evidence.
- Action : either selection of a tool or the “Stop” action (emitting a final preference ).
- Transition: Tool actions invoke external search; evidence is appended to the working context. Termination is triggered when “Stop” is chosen, yielding the model’s final preference verdict.
The policy samples at each step: Tool invocation is bounded ( steps maximum), ensuring tractability. This process allows OpenRM to perform context-driven evidence gathering, in contrast to both scalar RMs and LLM-as-judge baselines, which lack explicit external querying capacity.
2. Training via Group Relative Policy Optimization (GRPO)
OpenRM is trained with a group-based variant of Proximal Policy Optimization (PPO), termed Group Relative Policy Optimization (GRPO). For each query, sampled trajectories are generated, and rewards are defined as a composite function: where:
- : 1 if the final preference matches ground-truth, else 0.
- : count of appropriate tool calls within the trajectory.
- : balancing parameter (default 0.5).
The group-relative advantage is computed as: The policy gradient objective with clipping and KL regularization is given by: where , and are hyperparameters. Supervision is exclusively via RL: no cross-entropy loss is imposed on tool steps.
3. Data Synthesis Pipeline and Supervisory Signals
Reward model training relies on over 27,000 automatically synthesized preference pairs across three domains: Wikipedia QA, scientific surveys, and medical QA.
- Data generation: Domain-specific documents are gathered; a high-capacity LLM (DeepSeek-V3) is prompted to produce self-contained, document-grounded queries.
- Label construction: For each query and reference document, is generated with document access and without, assigning .
- Scale: ≥9,000 training pairs per domain; evaluation sets cover both in-domain and out-of-domain tasks.
The reward signal jointly supervises:
- Intermediate tool usage (): provides dense, trajectory-level reward for appropriate tool calls.
- Final outcome accuracy (): serves as the sparse, ground-truth supervisory anchor.
4. Empirical Results, Integration, and Comparative Performance
OpenRM achieves state-of-the-art results on knowledge-intensive, long-form evaluation. On in-domain tasks (Wikipedia, scientific, medical):
| Model | Wiki | Scientific | Medical | Avg. |
|---|---|---|---|---|
| Best direct LLM-judge (GPT-4o) | 70.0 | 48.2 | 44.0 | 54.1 |
| Best train-based RM (RM-R1) | 55.4 | 54.8 | 52.3 | 54.2 |
| Agentic LLM-judge (GPT-4o+tools) | 76.4 | 58.6 | 53.4 | 62.8 |
| OPENRM (27K pairs) | 93.0 | 90.0 | 91.0 | 91.3 |
On out-of-domain benchmarks (PandaLM, RewardBench), OpenRM outperforms larger-scale reward models even with fewer training samples:
| Model | Train Data | PandaLM | RewardBench |
|---|---|---|---|
| RM-R1 | 72k | 72.7 | 68.3 |
| JudgeLRM | 100k | 72.3 | 74.4 |
| RRM | 420k | 77.7 | 78.5 |
| OPENRM | 27k | 79.4 | 77.7 |
Integration into LLM pipelines occurs both at inference (pairwise response selection, majority vote over OPENRM preferences) and at training (filtering trajectories for RLHF using Direct Preference Optimization). Algorithmic schemas are directly provided for each case in the source (Hu et al., 28 Oct 2025).
5. Model Limitations and Open Research Directions
OpenRM is subject to several concrete limitations:
- Tool dependence: Accuracy is contingent on search tool coverage and retrieval fidelity; noisy or latent evidence can degrade judgment quality.
- Text-only focus: Current system operates exclusively on textual evidence; extension to multimodal (image, tabular) data remains open.
- Scalability: Managing additional tools, orchestrating tool selection policies, and introducing persistent memory for evidence pose algorithmic and systems challenges.
- External bias vulnerability: Reliance on third-party repositories introduces possible bias and error propagation from external sources.
- Unexplored areas: Open questions include automated construction of tool pipelines, more granular supervisory reward signal structures, and mitigation of external knowledge-induced bias.
6. Technical Summary: Key Formulas and Decision Loop
- Composite reward:
- Group-relative advantage:
- GRPO objective:
- Decision process (pseudocode):
1 2 3 4 5 6 7 8 9 |
state = [q, x1, x2] for i in 1...n: a_i ~ πθ(· | state) if a_i is a tool: evidence = exec_tool(a_i, state) state.append(evidence) else: # Stop y ~ πθ(y | state) break |
Empirical results, formal methodology, and the technical framework are presented in detail by the OpenReward authors (Hu et al., 28 Oct 2025).