Papers
Topics
Authors
Recent
2000 character limit reached

AI-SearchPlanner: Modular RL for Efficient QA

Updated 31 December 2025
  • AI-SearchPlanner is a modular reinforcement learning framework that integrates a trainable search planner with a frozen QA generator to optimize both accuracy and cost.
  • The framework employs a Pareto-optimal, multi-objective reinforcement learning approach with dual-reward alignment to balance outcome performance and process efficiency.
  • Empirical results demonstrate significant accuracy improvements and reduced search turns across diverse datasets, highlighting its robust, plug-and-play deployment.

AI-SearchPlanner Framework

AI-SearchPlanner defines a principled, modular reinforcement learning framework for agentic, cost-sensitive information-seeking that integrates a small, trainable LLM as a search planner with a large, frozen LLM generator for high-quality question answering. By structurally decoupling search planning from answer generation and by formulating search trajectory optimization as a Pareto-optimal, multi-objective RL problem, AI-SearchPlanner achieves high answer accuracy and substantially reduced search/inference cost compared to prior end-to-end RL agents. The framework introduces dual-reward alignment (outcome and process) to govern planner behavior, modular interaction protocols, and generalizes robustly across frozen QA backends and data domains (Mei et al., 28 Aug 2025).

1. Architecture and System Workflow

AI-SearchPlanner operationalizes search-based QA using two distinct modules:

  • Search Planner (LLMₚₗₐₙ): A lightweight, trainable LLM responsible solely for planning search actions. At each timestep tt, it decides between issuing subqueries to a search engine S()S(\cdot) or terminating and invoking the generator.
  • QA Generator (LLMgₑₙ): A large, frozen LLM (e.g., Qwen3-32b, GPT-4) tasked with producing the final answer, conditioned on the entire accumulated trajectory, including previous queries, retrieved snippets, and planner reasoning.

Block-level Dataflow:

  1. Input question qq \rightarrow LLMₚₗₐₙ.
  2. At turn tt, LLMₚₗₐₙ emits either a "search" action with subqueries {sq}t\{sq\}^t \rightarrow search engine S({sq}t)S(\{sq\}^t) \rightarrow retrieved docs appended to trajectory context TT, or a "call_answer_llm" that packs TT into prompt PtP_t \rightarrow LLMgₑₙ(PtP_t) \rightarrow answer aa.
  3. Termination occurs when planner chooses "call_answer_llm".

By explicitly separating the reasoning-about-search from answer generation, AI-SearchPlanner avoids the performance tradeoffs associated with end-to-end training over both capacities (Mei et al., 28 Aug 2025).

2. Mathematical Formulation

Search planning in AI-SearchPlanner is formalized as a Markov decision process (MDP) with state sts_t (full trajectory) and action ata_t (search or terminate). The framework seeks to optimize two orthogonal objectives: end-to-end QA utility and search/inference cost.

Multi-Objective Reward Structure

  • Outcome Reward RoutcomeR_{outcome}: Measures net QA gain from planning over baselines (direct inference aIa_I, naive RAG aRa_R):

Routcome=12+Score(a,gt)12max{Score(aI,gt),Score(aR,gt)}[0,1.5]R_{outcome} = \tfrac{1}{2} + Score(a, gt) - \tfrac{1}{2} \max\{Score(a_I, gt), Score(a_R, gt)\} \in [0, 1.5]

  • Process Reward RprocessR_{process}: Rewards coherent, rational planning trajectories as evaluated by the frozen generator:

Rprocess=LLMgen(T,PT)[0,0.5]R_{process} = LLM_{gen}(T, P_T) \in [0, 0.5]

  • Aggregate Utility Reward:

Rutility=Routcome+RprocessR_{utility} = R_{outcome} + R_{process}

  • Cost Reward RcostR_{cost}: Penalizes long trajectories:

Rcost=Rcostturn+RcostqueryR_{cost} = R_{cost}^{turn} + R_{cost}^{query}

Where

Rcostturn=max(0,1LMt),Rcostquery=max(0,1i=1L{sq}iMq)R_{cost}^{turn} = \max(0, 1 - \tfrac{L}{M_t}),\qquad R_{cost}^{query} = \max(0, 1 - \tfrac{\sum_{i=1}^L|\{sq\}^i|}{M_q})

MtM_t and MqM_q are hard upper bounds for turns and sub-queries.

Pareto Objective

The overall objective balances accuracy and cost via a scalar coefficient α\alpha: Rpareto=Rutility+αRcost+RformatR_{pareto} = R_{utility} + \alpha\,R_{cost} + R_{format} RformatR_{format} is a syntax correctness check (≥0). Sweeping α\alpha generates a Pareto frontier of utility versus cost.

3. Reinforcement Learning Methods

AI-SearchPlanner optimizes the planner policy using Proximal Policy Optimization (PPO), structured as follows:

  • Surrogate objective:

L(θ)=Eτπθ ⁣[t=0Tmin(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]\mathcal{L}(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\!\left[ \sum_{t=0}^T \min( r_t(\theta) A_t,\,\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t ) \right]

where rt(θ)=πθ(atst)/πold(atst)r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{old}(a_t|s_t), and AtA_t is the advantage computed with respect to RparetoR_{pareto}.

  • Dual-Reward Alignment:

The planner receives feedback on both RoutcomeR_{outcome} and RprocessR_{process}, which ensures that trajectories are effective for QA and maintain rational step-wise planning.

  • Loss Masking:

Environmental tokens (retrieved docs) are masked out, propagating gradient only through planner-generated tokens, stabilizing RL updates.

4. Model Integration and Deployment Protocol

AI-SearchPlanner achieves decoupled, modular integration via the following protocols:

  • Tool-Call API: Planner emits JSON "tool_call" objects:
    1
    2
    
    {"name": "search", "arguments": {"query_list": ["..."]}}
    {"name": "call_answer_llm", "arguments": {}}
  • Prompt Engineering: Each planner turn appends reasoning, external search results, and tool calls to the context. Termination triggers packing the full trajectory into a final prompt for the generator.
  • Plug-and-Play QA Model: Post-training, LLMₚₗₐₙ can be paired with different frozen generators (e.g., Qwen3-32b, Deepseek-V3, Deepseek-R1) without retraining, yielding robust generalization.

5. Empirical Results and Ablations

Comprehensive experiments demonstrate superior accuracy and efficiency against contemporary agents:

Dataset Baseline Accuracy Search Turns AI-SearchPlanner Accuracy Search Turns
Wikipedia (Qwen3) Naive RAG 0.539 0.597 2.26
Wikipedia (Qwen3) Search-R1 0.519 0.597 2.26
Web QA (WebShaper) RAG 0.188 0.366
Web QA (WebWalker) RAG 0.297 0.375
Generator transfer Deepseek-V3 0.610
Generator transfer Deepseek-R1 0.648

Ablation studies highlight individual contributions:

  • Removing RoutcomeR_{outcome}: −15.2% accuracy
  • Removing RprocessR_{process}: −1.5%
  • Freezing the planner (no RL): −8.4%
  • Increasing α\alpha (cost weight) traces a Pareto frontier: low α\alpha gives high accuracy/low cost; very large α\alpha drives to 1 search turn but below baseline accuracy.

6. Key Principles and Practical Implications

  • Modularity and Decoupling: Specializing planner and generator roles eliminates catastrophic trade-offs and facilitates independent model upgrades for QA (Mei et al., 28 Aug 2025).
  • Fine-Grained Reward Alignment: Disentangling outcome and process rewards suppresses degenerate search behaviors (indefinite searching, premature termination).
  • Parameterizable Cost-Sensitivity: Exposing α\alpha for cost-utility tuning allows deployment in latency- or resource-constrained production, granting direct operator control over search behavior.
  • Seamless Integration: The JSON tool-call API, prompt templates for reasoning trajectories, and selective loss masking make the framework deployable in existing LLM+search infrastructures.

AI-SearchPlanner thus defines a generalizable recipe for high-accuracy, cost-aware search agents: train only the planner module under multi-objective RL, freeze the answer generator, and maintain a clean, composable system architecture. This results in enhanced accuracy, reduced latency/cost, and robust generalization across answer models and domains (Mei et al., 28 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AI-SearchPlanner Framework.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube