AI-SearchPlanner: Modular RL for Efficient QA

Updated 31 December 2025

AI-SearchPlanner is a modular reinforcement learning framework that integrates a trainable search planner with a frozen QA generator to optimize both accuracy and cost.
The framework employs a Pareto-optimal, multi-objective reinforcement learning approach with dual-reward alignment to balance outcome performance and process efficiency.
Empirical results demonstrate significant accuracy improvements and reduced search turns across diverse datasets, highlighting its robust, plug-and-play deployment.

AI-SearchPlanner defines a principled, modular reinforcement learning framework for agentic, cost-sensitive information-seeking that integrates a small, trainable LLM as a search planner with a large, frozen LLM generator for high-quality question answering. By structurally decoupling search planning from answer generation and by formulating search trajectory optimization as a Pareto-optimal, multi-objective RL problem, AI-SearchPlanner achieves high answer accuracy and substantially reduced search/inference cost compared to prior end-to-end RL agents. The framework introduces dual-reward alignment (outcome and process) to govern planner behavior, modular interaction protocols, and generalizes robustly across frozen QA backends and data domains (Mei et al., 28 Aug 2025).

1. Architecture and System Workflow

AI-SearchPlanner operationalizes search-based QA using two distinct modules:

Search Planner (LLMₚₗₐₙ): A lightweight, trainable LLM responsible solely for planning search actions. At each timestep $t$ , it decides between issuing subqueries to a search engine $S(\cdot)$ or terminating and invoking the generator.
QA Generator (LLMgₑₙ): A large, frozen LLM (e.g., Qwen3-32b, GPT-4) tasked with producing the final answer, conditioned on the entire accumulated trajectory, including previous queries, retrieved snippets, and planner reasoning.

Block-level Dataflow:

Input question $q \rightarrow$ LLMₚₗₐₙ.
At turn $t$ , LLMₚₗₐₙ emits either a "search" action with subqueries $\{sq\}^t \rightarrow$ search engine $S(\{sq\}^t) \rightarrow$ retrieved docs appended to trajectory context $T$ , or a "call_answer_llm" that packs $T$ into prompt $P_t \rightarrow$ LLMgₑₙ( $P_t$ ) $\rightarrow$ answer $a$ .
Termination occurs when planner chooses "call_answer_llm".

By explicitly separating the reasoning-about-search from answer generation, AI-SearchPlanner avoids the performance tradeoffs associated with end-to-end training over both capacities (Mei et al., 28 Aug 2025).

2. Mathematical Formulation

Search planning in AI-SearchPlanner is formalized as a Markov decision process (MDP) with state $s_t$ (full trajectory) and action $a_t$ (search or terminate). The framework seeks to optimize two orthogonal objectives: end-to-end QA utility and search/inference cost.

Multi-Objective Reward Structure

Outcome Reward $R_{outcome}$ : Measures net QA gain from planning over baselines (direct inference $a_I$ , naive RAG $a_R$ ):

$R_{outcome} = \tfrac{1}{2} + Score(a, gt) - \tfrac{1}{2} \max\{Score(a_I, gt), Score(a_R, gt)\} \in [0, 1.5]$

Process Reward $R_{process}$ : Rewards coherent, rational planning trajectories as evaluated by the frozen generator:

$R_{process} = LLM_{gen}(T, P_T) \in [0, 0.5]$

Aggregate Utility Reward:

$R_{utility} = R_{outcome} + R_{process}$

Cost Reward $R_{cost}$ : Penalizes long trajectories:

$R_{cost} = R_{cost}^{turn} + R_{cost}^{query}$

Where

$R_{cost}^{turn} = \max(0, 1 - \tfrac{L}{M_t}),\qquad R_{cost}^{query} = \max(0, 1 - \tfrac{\sum_{i=1}^L|\{sq\}^i|}{M_q})$

$M_t$ and $M_q$ are hard upper bounds for turns and sub-queries.

Pareto Objective

The overall objective balances accuracy and cost via a scalar coefficient $\alpha$ : $R_{pareto} = R_{utility} + \alpha\,R_{cost} + R_{format}$ $R_{format}$ is a syntax correctness check (≥0). Sweeping $\alpha$ generates a Pareto frontier of utility versus cost.

3. Reinforcement Learning Methods

AI-SearchPlanner optimizes the planner policy using Proximal Policy Optimization (PPO), structured as follows:

Surrogate objective:

$\mathcal{L}(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\!\left[ \sum_{t=0}^T \min( r_t(\theta) A_t,\,\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t ) \right]$

where $r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{old}(a_t|s_t)$ , and $A_t$ is the advantage computed with respect to $R_{pareto}$ .

Dual-Reward Alignment:

The planner receives feedback on both $R_{outcome}$ and $R_{process}$ , which ensures that trajectories are effective for QA and maintain rational step-wise planning.

Loss Masking:

Environmental tokens (retrieved docs) are masked out, propagating gradient only through planner-generated tokens, stabilizing RL updates.

4. Model Integration and Deployment Protocol

AI-SearchPlanner achieves decoupled, modular integration via the following protocols:

Tool-Call API: Planner emits JSON "tool_call" objects:

1 2	{"name": "search", "arguments": {"query_list": ["..."]}} {"name": "call_answer_llm", "arguments": {}}

Prompt Engineering: Each planner turn appends reasoning, external search results, and tool calls to the context. Termination triggers packing the full trajectory into a final prompt for the generator.
Plug-and-Play QA Model: Post-training, LLMₚₗₐₙ can be paired with different frozen generators (e.g., Qwen3-32b, Deepseek-V3, Deepseek-R1) without retraining, yielding robust generalization.

5. Empirical Results and Ablations

Comprehensive experiments demonstrate superior accuracy and efficiency against contemporary agents:

Dataset	Baseline	Accuracy	Search Turns	Accuracy	Search Turns
Wikipedia (Qwen3)	Naive RAG	0.539	–	0.597	2.26
Wikipedia (Qwen3)	Search-R1	0.519	–	0.597	2.26
Web QA (WebShaper)	RAG	0.188	–	0.366	–
Web QA (WebWalker)	RAG	0.297	–	0.375	–
Generator transfer	Deepseek-V3	0.610	–	–	–
Generator transfer	Deepseek-R1	0.648	–	–	–

Ablation studies highlight individual contributions:

Removing $R_{outcome}$ : −15.2% accuracy
Removing $R_{process}$ : −1.5%
Freezing the planner (no RL): −8.4%
Increasing $\alpha$ (cost weight) traces a Pareto frontier: low $\alpha$ gives high accuracy/low cost; very large $\alpha$ drives to 1 search turn but below baseline accuracy.

6. Key Principles and Practical Implications

Modularity and Decoupling: Specializing planner and generator roles eliminates catastrophic trade-offs and facilitates independent model upgrades for QA (Mei et al., 28 Aug 2025).
Fine-Grained Reward Alignment: Disentangling outcome and process rewards suppresses degenerate search behaviors (indefinite searching, premature termination).
Parameterizable Cost-Sensitivity: Exposing $\alpha$ for cost-utility tuning allows deployment in latency- or resource-constrained production, granting direct operator control over search behavior.
Seamless Integration: The JSON tool-call API, prompt templates for reasoning trajectories, and selective loss masking make the framework deployable in existing LLM+search infrastructures.

AI-SearchPlanner thus defines a generalizable recipe for high-accuracy, cost-aware search agents: train only the planner module under multi-objective RL, freeze the answer generator, and maintain a clean, composable system architecture. This results in enhanced accuracy, reduced latency/cost, and robust generalization across answer models and domains (Mei et al., 28 Aug 2025).

Markdown Upgrade to Chat

References (1)

AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AI-SearchPlanner Framework.