Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

Published 20 May 2025 in cs.IR | (2505.14069v2)

Abstract: Retrieval-augmented generation (RAG) enhances the text generation capabilities of LLMs by integrating external knowledge and up-to-date information. However, traditional RAG systems are limited by static workflows and lack the adaptability required for multistep reasoning and complex task management. To address these limitations, agentic RAG systems (e.g., DeepResearch) have been proposed, enabling dynamic retrieval strategies, iterative context refinement, and adaptive workflows for handling complex search queries beyond the capabilities of conventional RAG. Recent advances, such as Search-R1, have demonstrated promising gains using outcome-based reinforcement learning, where the correctness of the final answer serves as the reward signal. Nevertheless, such outcome-supervised agentic RAG methods face challenges including low exploration efficiency, gradient conflict, and sparse reward signals. To overcome these challenges, we propose to utilize fine-grained, process-level rewards to improve training stability, reduce computational costs, and enhance efficiency. Specifically, we introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for (i) query generation, (ii) evidence extraction, and (iii) answer generation, thereby enhancing model inherent capabilities via process-supervised reinforcement learning. With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers. Compared to existing approaches such as Search-R1 and traditional RAG systems, ReasonRAG, leveraging RAG-ProGuide, achieves superior performance on five benchmark datasets using only 5k training instances, significantly fewer than the 90k training instances required by Search-R1.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces ReasonRAG, a reinforcement learning framework that leverages process-level rewards to optimize agentic RAG systems.
It employs Monte Carlo Tree Search and Shortest Path Reward Estimation to efficiently explore decision spaces and evaluate reasoning trajectories.
Experimental results demonstrate that ReasonRAG improves training stability, reduces computational costs, and outperforms traditional outcome-supervised methods.

Process vs. Outcome Reward in Agentic RAG Reinforcement Learning

The paper "Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning" explores a novel approach to reinforcement learning (RL) in retrieval-augmented generation (RAG) systems. It introduces the ReasonRAG method, which utilizes process-level rewards to address challenges in agentic RAG systems, offering improvements over traditional outcome-supervised RL methods.

Introduction to ReasonRAG

The integration of external knowledge sources in retrieval-augmented generation (RAG) systems enhances the capabilities of LLMs by allowing them to incorporate accurate, up-to-date information. Traditional RAG systems often utilize static workflows that limit their adaptability in complex multi-step reasoning tasks. Agentic RAG systems mitigate these limitations through dynamic retrieval strategies and adaptive workflows, managing intricate search queries more adeptly.

ReasonRAG is introduced as a reinforcement learning framework utilizing process-level rewards. This approach aims to enhance training stability, reduce computational costs, and improve efficiency by employing a novel dataset called RAG-ProGuide. This dataset provides process-level supervision in query generation, evidence extraction, and answer generation, facilitating improved model capabilities via process-supervised reinforcement learning.

Framework and Implementation

ReasonRAG employs Monte Carlo Tree Search (MCTS) for efficient exploration of decision spaces and identification of high-reward intermediate steps in RAG processes.

Figure 1: Framework of ReasonRAG, illustrating policy optimization and inference example.

Shortest Path Reward Estimation (SPRE)

A key innovation in ReasonRAG is SPRE, which offers fine-grained reward estimation at each reasoning step to optimize efficiency. It simulates possible outcomes of partial reasoning sequences and applies penalties for unnecessarily long trajectories, enabling the model to choose paths that lead directly to correct answers.

Data Generation and MCTS Exploration

RAG-ProGuide dataset construction utilizes MCTS to explore potential reasoning trajectories. This process collects high-quality supervision data for enhancing agentic RAG policy optimization. MCTS assists in efficiently navigating the vast decision space by selectively expanding promising reasoning paths.

Performance Evaluation

ReasonRAG demonstrates substantial performance improvements over existing methods, such as Search-R1, across multiple benchmarks. It achieves these results using significantly fewer training instances, exemplifying the efficiency of process-supervised RL.

Figure 2: Training cost and convergence speed comparison for ReasonRAG and Search-R1.

Retrieval Iterations and Document Effectiveness

The model's performance improves with the number of retrieval iterations up to a saturation point, highlighting ReasonRAG's adaptive reasoning capabilities. Additionally, the number of top-k retrieved documents impacts performance, with more retrieved documents contributing to enhanced accuracy in multi-hop reasoning scenarios.

Figure 3: EM performance across varying retrieval iterations on benchmarks.

Figure 4: Effect of top-k retrieved documents on ReasonRAG's performance across datasets.

Conclusion

ReasonRAG signifies a meaningful advancement in agentic RAG systems by leveraging process-level rewards for efficient reinforcement learning. This innovative approach provides nuanced guidance for policy optimization, displaying robust improvements in both in-domain and out-of-domain generalization tasks. The reduction in computational requirements and data efficiency further underscores the potential of process-supervised RL in enhancing real-world applications of RAG systems. This research contributes towards evolving methodologies in AI that integrate external knowledge sources dynamically for advanced problem-solving. Overall, ReasonRAG's methodology and findings hold significant implications for future developments in adaptive RAG frameworks, underscoring the importance of optimizing not just the outcome, but the journey to it within AI systems.

Markdown Report Issue