Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 67 tok/s

Gemini 2.5 Pro 62 tok/s Pro

GPT-5 Medium 41 tok/s Pro

GPT-5 High 37 tok/s Pro

GPT-4o 137 tok/s Pro

Kimi K2 190 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS (2508.14313v2)

Published 19 Aug 2025 in cs.LG and cs.AI

Abstract: Test-time scaling (TTS) for LLMs has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.

Summary

The paper presents a novel framework that leverages RL-derived reward functions as effective process reward models to guide search algorithms.
It integrates AIRL and GRPO to train policies with dense rewards, eliminating the need for costly labeled data.
Experiments across eight benchmarks show a 9% average boost in LLM reasoning performance over baseline PRMs during search.

Unifying Reinforcement Learning (RL) and Search-Based Test-Time Scaling (TTS) for LLMs

The paper "Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS" presents a pioneering methodology that integrates RL-based and search-based TTS for improving the reasoning performance of LLMs. This work introduces a new framework that utilizes the reward function obtained from RL as an effective process reward model (PRM) for guiding search algorithms at test time.

Method Overview

The core idea is to leverage the reward function, which is inherently learned during RL training, as an ideal PRM for guiding search procedures. This unification approach is implemented using Adversarial Inverse Reinforcement Learning (AIRL) combined with Group Relative Policy Optimization (GRPO). The paper demonstrates that the PRM can serve dual purposes: providing dense rewards during training and guiding search strategies during inference, thus reducing the dependence on static PRMs that require extensive labeled data.

Figure 1: Overview of the unification framework. During training, the AIRL discriminator is used to learn a PRM, optimizing the policy with both dense rewards from AIRL and outcome rewards from GRPO. At test time, the trained policy and PRM jointly guide downstream search algorithms.

Experimental Results

The experiments conducted on eight benchmarks across mathematics, scientific reasoning, and code generation show that this unified approach significantly enhances model performance by 9% on average compared to the base model. Furthermore, when integrated into various search algorithms, the PRM consistently outperforms all baseline PRMs trained with labeled data.

Figure 2: Average performance of four PRMs applied to four generative LLMs using Best-of-N with 64 rollouts on AIME2024, AMC, and MATH500. The AIRL-S-PRM consistently delivers the highest test-time search performance.

Implementation Details

Training Process

The RL training process involves using AIRL to learn a step-wise PRM directly from correct reasoning traces. This eliminates the need for labeled intermediate process data, thereby mitigating costs and avoiding reward hacking risks associated with static PRMs. The policy model is updated using a combination of objectives derived from AIRL and GRPO.

Test-Time Search

At inference, the PRM extends the logic chaining ability of the policy model by guiding search processes such as Monte Carlo Tree Search (MCTS), beam search, and Best-of-N sampling. This is achieved through a framework that applies the learned PRM for real-time decision-making in extending reasoning chains.

Figure 3: Comparison of test-time search performance with the PRM applied to MCTS, Beam Search, and Best-of-N across varying rollout counts. The PRM consistently improves performance for all search techniques.

Implications and Future Work

The implications of this work extend to several areas:

Scientific and Educational Tools: The integrated framework can serve as a foundation for developing advanced educational AI tools and scientific computing platforms, lowering barriers to access in under-resourced regions.
Software Development: In software engineering, incorporating such models could lead to more efficient and reliable code generation systems.
Future Research Directions: Expanding this unification framework to other architectures and broader datasets could address scalability and generalization, potentially paving the way for more robust AI systems.

Conclusion

This paper establishes a novel unification of RL-based and search-based TTS methodologies, demonstrating practical improvements in LLM performance across diverse reasoning tasks. By repurposing the RL-derived reward function as a versatile PRM, the approach not only enhances inference capabilities but also reduces the reliance on costly labeled datasets, thus offering a cost-efficient and scalable solution for TTS in LLMs. Future work could explore scalability aspects and test the approach on more diverse architectures and applications.