Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 92 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 19 tok/s

GPT-5 High 18 tok/s Pro

GPT-4o 96 tok/s

GPT OSS 120B 473 tok/s Pro

Kimi K2 26 tok/s Pro

2000 character limit reached

AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning (2508.20368v1)

Published 28 Aug 2025 in cs.AI

Abstract: Recent studies have explored integrating LLMs with search engines to leverage both the LLMs' internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbf{AI-SearchPlanner}, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.

Collections

Summary

The paper introduces a modular framework that separates a trainable search planner from a frozen answer generator, enhancing efficiency and scalability.
It employs a dual-reward mechanism aligning outcome quality and process coherence to achieve a Pareto-optimal balance between utility and cost.
Empirical results show a 10.76% improvement over non-planning baselines, demonstrating strong generalization across various models and datasets.

Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning: The AI-SearchPlanner Framework

Motivation and Background

The integration of LLMs with search engines has become a central paradigm for knowledge-intensive reasoning tasks. While RAG architectures have enabled LLMs to access external information, their static retrieval mechanisms and lack of adaptive planning limit performance, especially for multi-hop and compositional queries. RL-based approaches have recently shown promise in enabling LLMs to learn search strategies through reward-driven multi-turn interactions. However, most prior RL-based search agents employ a monolithic LLM for both search planning and QA, constraining the optimization of each component and impeding scalability in real-world deployments where frozen, large LLMs are preferred for QA.

Framework Overview and Architectural Decoupling

AI-SearchPlanner introduces a modular agentic search framework that explicitly decouples the search planner and the answer generator. The search planner is a small, trainable LLM responsible for orchestrating multi-turn search interactions, while the generator is a large, frozen LLM dedicated to answer synthesis. This separation enables efficient training and deployment, allowing the planner to specialize in search strategy optimization and the generator to leverage its pre-trained QA capabilities.

Figure 1: The overview of AI-SearchPlanner framework.

The planner iteratively decides whether to issue sub-queries to the search engine or to terminate and invoke the generator with the accumulated context. This modularity supports plug-and-play integration with various frozen LLMs and retrievers, facilitating transferability across domains and models.

Dual-Reward Alignment for Search Planning

A key innovation is the dual-reward mechanism for search planning, which aligns the planner's policy with both outcome-level and process-level objectives:

Outcome Reward quantifies the performance gain of search planning over non-planning baselines (direct inference, naive RAG), using an LLM-based scoring function for answer correctness.
Process Reward evaluates the rationality and coherence of the search trajectory, penalizing unclear references, redundant queries, and inefficient strategies via a prompt-based rubric.

The combined utility reward ensures that the planner not only improves answer accuracy but also generates interpretable, efficient search trajectories.

Pareto Optimization of Utility and Cost

AI-SearchPlanner formalizes search planning as a multi-objective RL problem, balancing planning utility (answer quality, trajectory rationality) against planning cost (number of search turns, sub-query frequency). The overall reward is:

$R_{pareto} = R_{utility} + \alpha \cdot R_{cost} + R_{format}$

where $\alpha$ controls the trade-off between utility and cost, and $R_{format}$ enforces output correctness. By varying $\alpha$ , the framework explores the Pareto frontier, enabling practitioners to select operating points that optimize for either performance or efficiency.

Figure 2: Utility-Cost tradeoffs on Wikipedia-based datasets. Blue points: non-planning baselines; orange points: AI-SearchPlanner with different cost coefficient $\alpha$ .

Reinforcement Learning Training and Implementation

The planner is trained using PPO, with loss masking applied to retrieved tokens to prevent gradient interference from environmental observations. Each rollout consists of a full search trajectory, culminating in answer generation by the frozen LLM and reward computation. The RL objective is:

$\mathbb{L}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \min \left( \frac{\pi_{\theta}(a_t | s_t)}{\pi_{old}(a_t | s_t)} A_t, \text{clip}\left( \frac{\pi_{\theta}(a_t | s_t)}{\pi_{old}(a_t | s_t)}, 1-\epsilon, 1+\epsilon \right) A_t \right) \right]$

This approach is agnostic to the choice of retriever and generator, supporting flexible deployment.

Figure 3: Training dynamics of AI-SearchPlanner with cost coefficient $\alpha = 0$ .

Empirical Results and Analysis

Experiments on seven Wikipedia-based and two web-based datasets demonstrate that AI-SearchPlanner consistently outperforms non-planning and planning baselines in both answer accuracy and efficiency. Notably:

With $\alpha=0$ , AI-SearchPlanner achieves a 10.76% improvement in average QA accuracy over the best non-planning baseline on Wikipedia datasets.
The model exhibits strong generalization across frozen generators (Qwen3-32b, Deepseek-V3, Deepseek-R1) and domains, with robust transferability to web-based QA tasks.
Increasing $\alpha$ reduces planning cost (number of turns) with only marginal initial loss in accuracy, enabling practitioners to tune for latency or resource constraints.
Ablation studies confirm that both outcome and process rewards are necessary for optimal performance, and RL training of the planner is essential for generalization to complex multi-hop queries.

Implementation Considerations

Resource Requirements: The decoupled architecture allows the planner to be lightweight, reducing inference latency and memory footprint compared to monolithic RL agents.
Deployment: The modular design supports integration with production QA systems using frozen LLMs, retrievers, and search APIs.
Limitations: The framework relies on the quality of the retriever and the frozen generator; suboptimal retrieval or answer synthesis can bottleneck overall performance. The LLM-based scoring function for answer correctness may introduce evaluation bias if not calibrated.

Implications and Future Directions

AI-SearchPlanner advances the agentic search paradigm by enabling modular, efficient, and transferable search planning. The Pareto optimization framework provides a principled approach to balancing answer quality and computational cost, which is critical for real-world applications with latency or resource constraints. The dual-reward alignment mechanism ensures that search trajectories are both effective and interpretable, supporting downstream auditing and debugging.

Future research may extend this framework to multi-modal search tasks, incorporate dynamic reward shaping for improved generalization, and explore hierarchical planner architectures for even more complex reasoning scenarios. The modularity of AI-SearchPlanner positions it as a foundation for scalable, adaptive agentic search systems in both academic and industrial settings.

Conclusion

AI-SearchPlanner presents a modular RL-based framework for agentic search, decoupling search planning from answer generation and optimizing for both utility and cost via Pareto frontier exploration. Empirical results demonstrate superior performance and efficiency over existing baselines, with strong generalization across models and domains. The framework's innovations in dual-reward alignment and multi-objective optimization have significant implications for the design of scalable, resource-efficient AI search systems. Future work should investigate extensions to multi-modal reasoning and dynamic reward mechanisms to further enhance adaptability and robustness.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (3)

Tweets

https://twitter.com/_reachsumit/status/1961254841323515978

alphaXiv

AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning (8 likes, 0 questions)