- The paper introduces agentic reward modeling, which integrates verifiable correctness signals (like factuality and instruction following) alongside human preferences to create more reliable reward systems for training and using LLMs.
- The proposed RewardAgent architecture includes a Router to select relevant Verification Agents (Factuality and Instruction-Following) and a Judger to combine their signals with base human preference scores.
- Experiments demonstrate that RewardAgent significantly outperforms traditional reward models on relevant benchmarks and improves performance in downstream applications like best-of-n generation and DPO training.
The paper introduces agentic reward modeling, a reward system designed to enhance the reliability of reward models (RMs) used in training and inference with LLMs. It addresses the limitations of existing RMs that primarily focus on human preferences, which can be subjective and overlook verifiable correctness signals such as factuality and instruction following.
The central claim is that integrating verifiable correctness signals into reward modeling can lead to more reliable rewards, thereby improving the performance and trustworthiness of LLMs.
Key aspects of the agentic reward modeling approach:
- It combines traditional reward models based on human preferences with verifiable correctness signals.
- It enhances reliability through multi-dimensional correctness signals.
- It enables flexible integration of diverse verification agents.
- It improves the interpretability of the final reward.
To empirically validate the proposed approach, the authors implement a reward agent called RewardAgent. RewardAgent integrates a conventional human preference-based reward model with correctness signals from two aspects:
- Factuality: Assesses the factual correctness of the claimed facts in the response.
- Instruction-following: Evaluates whether the response adheres to the hard constraints in the instruction, such as length constraints.
The architecture of RewardAgent includes three modules:
- Router: Analyzes the instruction to determine the appropriate verification agents to invoke.
- Verification Agents: Evaluate the correctness of responses in different aspects, including factuality and instruction-following. The factuality agent uses pairwise comparison, query generation, evidence generation (using either a search engine or the model's parametric knowledge), and verification. The instruction-following agent extracts hard constraints, generates constraint checker code (Python code script), and executes the code for verification.
- Judger: Integrates the correctness signals from the verification agents and human preference scores from the reward models to provide an overall reward score.
The base reward model for computing human preference scores in RewardAgent is ArmoRM. GPT-4o mini and Llama3-8B Instruct are used as the backbone LLMs for all the modules and implement RewardAgent and RewardAgent, respectively. The instruction-following agent in RewardAgent uses Qwen2.5-Coder 7B as its LLM backbone.
The paper evaluates the effectiveness of RewardAgent through comprehensive experiments:
- Evaluation on reward model benchmarks: RewardAgent is evaluated on RM-Bench and JudgeBench, which contain response pairs that involve factual correctness, and IFBench, which is newly constructed for instruction-following and contains 444 instances, each of which includes an instruction with several hard constraints, a chosen response that satisfies all constraints, and a rejected response that violates some constraints. RewardAgent significantly outperforms other advanced reward models on these benchmarks.
- Application in real-world downstream tasks: RewardAgent is applied to inference-time best-of-n search and constructing training preference pairs.
- Best-of-n search: Evaluated on the factuality question answering dataset TriviaQA and instruction-following datasets IFEval and CELLO, using Llama3-8B Instruct and GPT-4o as policy models to generate 32 responses for each instruction with 1.0 sampling temperature. RewardAgent significantly outperforms the base reward model AromRM in the best-of-n search.
- Training with Direct Preference Optimization (DPO): RewardAgent is used to construct training preference pairs from UltraFeedback and on-policy data. Zephyr-7B is adopted as the policy model and trained using DPO. The LLM trained on RewardAgent-constructed data consistently outperforms those trained on AromRM annotations on several NLP benchmarks.
The fundamental concept of agentic reward modeling is formulated as:
r(x,y)=λ⋅rRM(x,y)+i∈Ax∑wi⋅ai(x,y)
Where:
- r(x,y) is the overall reward score for a given instruction x and response y.
- λ is the weight of the base reward model.
- rRM(x,y) is the reward score from the base reward model.
- Ax is an index subset of the complete set of verification agents A, determined based on the instruction x.
- wi is the weight for each verification agent.
- ai(x,y) is the verifiable correctness signal provided by a specific verification agent.
The ablation paper showed that removing the well-designed verification agent leads to a significant performance decrease. The oracle setting results demonstrate the effectiveness of the verification agents and suggests that the planner in RewardAgent still has a large room for improvement.
In best-of-n experiments, RewardAgent significantly improves the best-of-n performance compared to using the base reward model ArmoRM, and the oracle setting further improves the results.
In DPO training, LLMs trained with data constructed by RewardAgent generally outperform those trained with ArmoRM, especially on the factuality question answering and instruction-following datasets. Furthermore, models trained with RewardAgent-annotated data consistently outperform those trained on original UltraFeedback that is constructed with GPT-4.