Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 72 tok/s
GPT OSS 120B 441 tok/s Pro
Kimi K2 200 tok/s Pro
2000 character limit reached

AI-SearchPlanner: Modular RL for Search & QA

Updated 29 August 2025
  • AI-SearchPlanner is a modular reinforcement learning framework that decouples search planning from answer generation in LLM-augmented retrieval systems.
  • It introduces dual-reward alignment, using outcome and process rewards to ensure both high-quality answers and coherent multi-turn search strategies.
  • The system employs Pareto optimization to balance answer performance against computational cost, achieving robust multi-hop reasoning and cross-domain generalization.

AI-SearchPlanner is a modular, reinforcement learning-based agent architecture that explicitly separates search planning from answer generation in LLM–augmented information retrieval and question answering systems. It introduces specialized design principles—architectural decoupling, dual-reward alignment, and Pareto optimization—addressing limitations in existing LLM-based agents and achieving superior answer accuracy, search efficiency, and cross-domain generalization in information-seeking tasks (Mei et al., 28 Aug 2025).

1. Decoupled Modular Architecture

AI-SearchPlanner employs a dual-agent architecture:

  • Search Planner: a small, trainable LLM specialized exclusively for multi-turn interaction with external search engines, generating sub-queries and accumulating external evidence.
  • Answer Generator ("Frozen QA model"): a large, high-quality, frozen LLM (e.g., GPT-4, DeepSeek-R1) responsible for synthesizing the final answer, using both retrieved documents and the dialogue history.

The search planner executes an iterative reasoning loop, at each step deciding whether to:

  1. Generate one or more search queries to a retrieval system, ingest external results, and continue planning, or
  2. Terminate the planning episode, handing all collected content (query/response pairs, retrieved passages) to the answer generator.

The generator model remains unmodified and is not updated during planner training. All RL optimization is focused exclusively on the planner. This strict decoupling enables independent specialization, increases computational efficiency, and avoids interference between planning and LLMing objectives.

2. Dual-Reward Alignment Mechanism

The framework introduces two complementary reward channels, enforcing coherent planning and outcome-oriented behavior:

  • Outcome Reward (RoutcomeR_{\mathrm{outcome}}): Evaluates the quality improvement of the final answer aa given by the planner+generator over simpler baselines (aIa_I: direct inference; aRa_R: naive retrieval-augmented LLM). This is realized via an automatic LLM-based scoring function:

Routcome=12+Score(a,gt)12max{Score(aI,gt),Score(aR,gt)}R_{\mathrm{outcome}} = \frac{1}{2} + \mathrm{Score}(a, gt) - \frac{1}{2} \cdot \max\{\mathrm{Score}(a_I, gt), \mathrm{Score}(a_R, gt)\}

where Score\mathrm{Score} is an LLM-powered metric computing answer similarity against the ground truth (gtgt).

  • Process Reward (RprocessR_{\mathrm{process}}): Independently, the logical coherence of the entire planning trajectory TT (sequence of tool calls and retrieved documents) is evaluated. The process reward is computed by prompting the frozen generator with a prompt PTP_T that assesses whether the planning steps constitute a reasonable and meaningful search sequence:

Rprocess=LLMgen(T,PT)R_{\mathrm{process}} = \mathrm{LLM}_{\mathrm{gen}}(T, P_T)

  • The overall planning utility is:

Rutility=Routcome+RprocessR_{\mathrm{utility}} = R_{\mathrm{outcome}} + R_{\mathrm{process}}

This dual-reward alignment ensures the planner not only boosts answer quality but does so following rational, interpretable multi-turn search procedures.

3. Pareto Optimization of Planning Utility and Cost

To balance planning efficacy against computational and resource overhead, AI-SearchPlanner formulates a multi-objective reward function that trades off between utility and resource consumption. Planning cost is split into:

  • Turn Cost

Rcostturn=max(0,1LMt)R_{\mathrm{cost}}^{turn} = \max(0, 1 - \frac{L}{M_t})

where LL is the number of planning turns, MtM_t a user-defined turn cap.

  • Query Cost

Rcostquery=max(0,1i{sq}iMq)R_{\mathrm{cost}}^{query} = \max(0, 1 - \frac{\sum_{i} |\{\mathsf{sq}\}^i|}{M_q})

with {sq}i|\{\mathsf{sq}\}^i| the count of sub-queries in turn ii, MqM_q a maximum allowed per episode.

The composite Pareto-optimized reward is:

Rpareto=Rutility+αRcost+RformatR_{\mathrm{pareto}} = R_{\mathrm{utility}} + \alpha \cdot R_{\mathrm{cost}} + R^{\mathrm{format}}

where α\alpha allows precise control of utility/cost weighting, and RformatR^{\mathrm{format}} is a formatting reward. Planning policies are trained with PPO, and loss masking ensures gradients affect only planner-generated tokens.

4. Empirical Evaluation and Generalization

AI-SearchPlanner is evaluated on both open-domain and multi-hop QA datasets (Wikipedia-based, WebShaper, WebWalker). Extensive ablation and baseline comparisons show:

  • Significant improvements in answer accuracy and search efficiency compared to single-LLM RL planners (e.g., Search-R1, IRCoT) and traditional direct inference or naive RAG baselines.
  • Superior multi-hop reasoning: The benefit is amplified on complex queries requiring reasoning over multiple retrieved facts, as the planner autonomously determines optimal sub-query decomposition and evidence accumulation.
  • Domain and model transferability: Training on Wikipedia QA, the planner generalizes to web domains and new frozen generators, indicating robust architecture-level generality.

5. Technical Formulation

All key components are mathematically formalized. Relevant objectives:

Component Formula
Outcome Reward Routcome=12+Score(a,gt)12max{Score(aI,gt),Score(aR,gt)}R_{\mathrm{outcome}} = \frac{1}{2} + \mathrm{Score}(a, gt) - \frac{1}{2}\max\{\mathrm{Score}(a_I, gt), \mathrm{Score}(a_R, gt)\}
Process Reward Rprocess=LLMgen(T,PT)R_{\mathrm{process}} = \mathrm{LLM}_{\mathrm{gen}}(T, P_T)
Utility Rutility=Routcome+RprocessR_{\mathrm{utility}} = R_{\mathrm{outcome}} + R_{\mathrm{process}}
Planning Costs Rcostturn,RcostqueryR_{\mathrm{cost}}^{turn}, R_{\mathrm{cost}}^{query}
Pareto Reward Rpareto=Rutility+αRcost+RformatR_{\mathrm{pareto}} = R_{\mathrm{utility}} + \alpha R_{\mathrm{cost}} + R^{format}
PPO RL Objective L(θ)=Eτ[tmin(πθ(atst)πold(atst)At,clip())]L(\theta) = \mathbb{E}_\tau \left[\sum_t \min\left( \frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)} A_t, \mathrm{clip}(\cdot) \right)\right]

All retrieval (sub-queries, tool outputs) are handled as external context and not included in gradient computation, focusing optimization strictly on the planner policy.

6. Research Impact and Future Directions

AI-SearchPlanner advances modular agent-based information retrieval in several respects:

  • Establishes the separation of planning and generation as a central systems design principle for LLM-augmented search.
  • Introduces a dual-reward structure aligning both outcome quality and procedural interpretability.
  • Explicitly models and optimizes the tradeoff between answer performance and planning/querying cost via multi-objective Pareto optimization.

Proposed extensions include incorporating multi-modal (image/text) search, more adaptive dynamic reward tuning, and refined management of the utility/cost balance for diverse application scenarios. This modular, agentic framework defines a trajectory for future RL-based LLM search agents that can deliver high retrieval accuracy, transparency, and computational efficiency across a wide range of domains and large pretrained models (Mei et al., 28 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)