AI-SearchPlanner: Modular RL for Efficient QA
- AI-SearchPlanner is a modular reinforcement learning framework that integrates a trainable search planner with a frozen QA generator to optimize both accuracy and cost.
- The framework employs a Pareto-optimal, multi-objective reinforcement learning approach with dual-reward alignment to balance outcome performance and process efficiency.
- Empirical results demonstrate significant accuracy improvements and reduced search turns across diverse datasets, highlighting its robust, plug-and-play deployment.
AI-SearchPlanner Framework
AI-SearchPlanner defines a principled, modular reinforcement learning framework for agentic, cost-sensitive information-seeking that integrates a small, trainable LLM as a search planner with a large, frozen LLM generator for high-quality question answering. By structurally decoupling search planning from answer generation and by formulating search trajectory optimization as a Pareto-optimal, multi-objective RL problem, AI-SearchPlanner achieves high answer accuracy and substantially reduced search/inference cost compared to prior end-to-end RL agents. The framework introduces dual-reward alignment (outcome and process) to govern planner behavior, modular interaction protocols, and generalizes robustly across frozen QA backends and data domains (Mei et al., 28 Aug 2025).
1. Architecture and System Workflow
AI-SearchPlanner operationalizes search-based QA using two distinct modules:
- Search Planner (LLMₚₗₐₙ): A lightweight, trainable LLM responsible solely for planning search actions. At each timestep , it decides between issuing subqueries to a search engine or terminating and invoking the generator.
- QA Generator (LLMgₑₙ): A large, frozen LLM (e.g., Qwen3-32b, GPT-4) tasked with producing the final answer, conditioned on the entire accumulated trajectory, including previous queries, retrieved snippets, and planner reasoning.
Block-level Dataflow:
- Input question LLMₚₗₐₙ.
- At turn , LLMₚₗₐₙ emits either a "search" action with subqueries search engine retrieved docs appended to trajectory context , or a "call_answer_llm" that packs into prompt LLMgₑₙ() answer .
- Termination occurs when planner chooses "call_answer_llm".
By explicitly separating the reasoning-about-search from answer generation, AI-SearchPlanner avoids the performance tradeoffs associated with end-to-end training over both capacities (Mei et al., 28 Aug 2025).
2. Mathematical Formulation
Search planning in AI-SearchPlanner is formalized as a Markov decision process (MDP) with state (full trajectory) and action (search or terminate). The framework seeks to optimize two orthogonal objectives: end-to-end QA utility and search/inference cost.
Multi-Objective Reward Structure
- Outcome Reward : Measures net QA gain from planning over baselines (direct inference , naive RAG ):
- Process Reward : Rewards coherent, rational planning trajectories as evaluated by the frozen generator:
- Aggregate Utility Reward:
- Cost Reward : Penalizes long trajectories:
Where
and are hard upper bounds for turns and sub-queries.
Pareto Objective
The overall objective balances accuracy and cost via a scalar coefficient : is a syntax correctness check (≥0). Sweeping generates a Pareto frontier of utility versus cost.
3. Reinforcement Learning Methods
AI-SearchPlanner optimizes the planner policy using Proximal Policy Optimization (PPO), structured as follows:
- Surrogate objective:
where , and is the advantage computed with respect to .
- Dual-Reward Alignment:
The planner receives feedback on both and , which ensures that trajectories are effective for QA and maintain rational step-wise planning.
- Loss Masking:
Environmental tokens (retrieved docs) are masked out, propagating gradient only through planner-generated tokens, stabilizing RL updates.
4. Model Integration and Deployment Protocol
AI-SearchPlanner achieves decoupled, modular integration via the following protocols:
- Tool-Call API: Planner emits JSON "tool_call" objects:
1 2
{"name": "search", "arguments": {"query_list": ["..."]}} {"name": "call_answer_llm", "arguments": {}} - Prompt Engineering: Each planner turn appends reasoning, external search results, and tool calls to the context. Termination triggers packing the full trajectory into a final prompt for the generator.
- Plug-and-Play QA Model: Post-training, LLMₚₗₐₙ can be paired with different frozen generators (e.g., Qwen3-32b, Deepseek-V3, Deepseek-R1) without retraining, yielding robust generalization.
5. Empirical Results and Ablations
Comprehensive experiments demonstrate superior accuracy and efficiency against contemporary agents:
| Dataset | Baseline | Accuracy | Search Turns | AI-SearchPlanner | Accuracy | Search Turns |
|---|---|---|---|---|---|---|
| Wikipedia (Qwen3) | Naive RAG | 0.539 | – | 0.597 | 2.26 | |
| Wikipedia (Qwen3) | Search-R1 | 0.519 | – | 0.597 | 2.26 | |
| Web QA (WebShaper) | RAG | 0.188 | – | 0.366 | – | |
| Web QA (WebWalker) | RAG | 0.297 | – | 0.375 | – | |
| Generator transfer | Deepseek-V3 | 0.610 | – | – | – | |
| Generator transfer | Deepseek-R1 | 0.648 | – | – | – |
Ablation studies highlight individual contributions:
- Removing : −15.2% accuracy
- Removing : −1.5%
- Freezing the planner (no RL): −8.4%
- Increasing (cost weight) traces a Pareto frontier: low gives high accuracy/low cost; very large drives to 1 search turn but below baseline accuracy.
6. Key Principles and Practical Implications
- Modularity and Decoupling: Specializing planner and generator roles eliminates catastrophic trade-offs and facilitates independent model upgrades for QA (Mei et al., 28 Aug 2025).
- Fine-Grained Reward Alignment: Disentangling outcome and process rewards suppresses degenerate search behaviors (indefinite searching, premature termination).
- Parameterizable Cost-Sensitivity: Exposing for cost-utility tuning allows deployment in latency- or resource-constrained production, granting direct operator control over search behavior.
- Seamless Integration: The JSON tool-call API, prompt templates for reasoning trajectories, and selective loss masking make the framework deployable in existing LLM+search infrastructures.
AI-SearchPlanner thus defines a generalizable recipe for high-accuracy, cost-aware search agents: train only the planner module under multi-objective RL, freeze the answer generator, and maintain a clean, composable system architecture. This results in enhanced accuracy, reduced latency/cost, and robust generalization across answer models and domains (Mei et al., 28 Aug 2025).