Dual-Reward Alignment for Search Planning

Updated 29 August 2025

The paper demonstrates that dual-reward alignment optimizes both answer quality and planning efficiency using outcome and process rewards.
It employs Pareto optimization to balance search accuracy with computational cost, mitigating redundant actions in multi-step tasks.
Empirical results show significant QA improvements and reduced planning turns, validating the framework's multi-objective strategy.

Dual-reward alignment for search planning refers to frameworks and methodologies that integrate two distinct reward signals to optimize agentic search trajectories or outputs, particularly within reinforcement learning (RL) and generative model contexts. The dual reward formulation typically encompasses an outcome-based reward (reflecting objective quality, e.g., correctness or human alignment) and a process-based reward (reflecting the efficiency, logicality, or consistency of the planning trajectory). Recent research has codified dual-reward alignment in a variety of architectures, from RL-driven active learning to modular search agents augmented with dynamic multi-objective optimization, achieving state-of-the-art results and strong generalization across model architectures and data domains.

1. Definition and Scope of Dual-Reward Alignment

Dual-reward alignment designates systems in which the search planning agent is trained and evaluated using two complementary reward signals. In the context of multi-step information retrieval and reasoning tasks, one reward is often dedicated to the outcome (i.e., the accuracy or correctness of final answers), while a second reward targets the quality or efficiency of search planning trajectories. This dual mechanism enables fine-grained control over both what is achieved (end result) and how it is achieved (planning process).

Papers such as "AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning" (Mei et al., 28 Aug 2025) and "DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent via Multi-Reward Reinforcement Learning" (Hao et al., 23 Jul 2025) have formalized this architecture by independently quantifying search planning utility and cost, and jointly optimizing these objectives via Pareto optimization or multi-reward RL.

2. Dual-Reward Formulation: Outcome and Process Objectives

The dual-reward mechanism typically consists of:

Outcome Reward $(R_{\text{outcome}})$ : Measures the improvement in answer quality yielded by search planning versus baseline approaches, such as direct inference or naive retrieval-augmented generation. For example, in AI-SearchPlanner:

$R_{\text{outcome}} = \frac{1}{2} + \mathrm{Score}(a,\mathrm{gt}) - \frac{1}{2} \max\left\{\mathrm{Score}(a_I,\mathrm{gt}),\ \mathrm{Score}(a_R,\mathrm{gt})\right\}$

where $a$ is the planner answer, $a_I$ is direct inference, $a_R$ is retrieval-augmented generation, and $\mathrm{Score}(\cdot)$ reflects correctness.

Process Reward $(R_{\text{process}})$ : Assesses the rationality or efficiency of the reasoning trajectory $T$ using a frozen generator model and dedicated prompt $P_T$ , such as:

$R_{\text{process}} = \mathrm{LLM}_{\mathrm{gen}}(T, P_T)$

Composite Utility: The aggregate planning utility is formed by summing both rewards:

$R_{\text{utility}} = R_{\text{outcome}} + R_{\text{process}}$

DynaSearcher applies a similar decomposition, aligning retrieval accuracy ( $F_1$ , CEM scores), information gain (document recall), and penalty for redundant search actions into the overall multi-reward RL objective.

3. Pareto Optimization and Planning Cost Trade-offs

An essential aspect of dual-reward alignment is balancing planning effectiveness and resource cost. AI-SearchPlanner models this as a Pareto optimization problem, where utility is traded against search cost metrics:

Planning Cost Terms:
- Turns: $R_{\text{cost}}^{\text{turn}} = \max(0, 1 - L/M_t)$ , where $L$ is the number of planning turns and $M_t$ a turn threshold.
- Queries: $R_{\text{cost}}^{\text{query}} = \max(0, 1 - (\sum | \{\text{sq}^i\} |)/M_q )$ , with $M_q$ a query threshold.
Pareto Reward:

$R_{\text{pareto}} = R_{\text{utility}} + \alpha R_{\text{cost}} + R_{\text{format}}$

$\alpha$ modulates the trade-off, permitting efficient exploration of diverse efficiency/quality frontiers. This architecture generalizes across search planning models and supports robust reasoning under resource constraints.

4. Architectural Decoupling and Modular Agent Design

Contemporary frameworks, such as AI-SearchPlanner (Mei et al., 28 Aug 2025), emphasize decoupled architectures wherein a small, trainable search planner LLM iteratively interacts with external search engines, while a large, frozen generator LLM produces the final answers. This division of labor allows:

Specialization: The planner focuses on decision-making, query formulation, and termination, while the generator leverages robust pretraining for fluency and factuality.
Efficiency: The planner can be independently retrained or optimized, facilitating transfer across QA models and retrieval domains.
Generalization: Experiments demonstrate that decoupled planners pair well with multiple frozen generators (Qwen3-32b, Deepseek-R1), consistently resulting in improved QA performance and reduced planning costs.

5. Composite Reward Functions and Multi-Reward RL

DynaSearcher (Hao et al., 23 Jul 2025) provides a canonical example of composite multi-reward RL formulation, integrating outcome reward (answer accuracy and format correctness), gain (information recall), and penalty (excessive search actions):

Accuracy reward:

$r_{\text{acc}} = \begin{cases} \max(0.1, r_{\text{ans}}), & \text{if output format correct} \ 0, & \text{otherwise} \end{cases}$

$r_{\text{ans}} = \begin{cases} F_1(a_{\text{pred}}, a_{\text{gt}}), & \text{if } L_{\text{pred}} \geq n \cdot L_{\text{gt}} \ \mathrm{CEM}(a_{\text{pred}}, a_{\text{gt}}), & \text{otherwise} \end{cases}$

Retrieval recall:

$r_{\text{recall}} = TP / (TP + FN)$

Penalty for redundancy:

$r_{\text{penalty}} = \max(\beta, 1 - \gamma^{t - i})$

Overall objective:

$r_{\text{overall}} = r_{\text{outcome}} + r_{\text{gain}}$

Multi-reward RL thereby operationalizes dual alignment by enabling the agent to balance factual accuracy, information richness, and search cost throughout the planning process.

6. Empirical Performance and Generalization

Experimental results validate dual-reward frameworks as highly effective:

AI-SearchPlanner achieves higher QA performance and lower planning costs compared to end-to-end RL agents (Search-R1, IRCoT, naive RAG), particularly for multi-hop questions.
The dual-reward mechanism successfully mitigates inefficient search trajectories and redundant computation, often exceeding baselines by over 10% in accuracy metrics and improving efficiency.
DynaSearcher matches or surpasses frontier LLM performance (GPT-4.1, DeepSeek-R1) even with smaller models, demonstrating pronounced generalization and robustness across locally hosted dense retrieval, KG-based retrieval, and online search contexts.

7. Prospects and Extensions

The dual-reward alignment paradigm supports modular and flexible search planning architectures, opening avenues for:

Adaptive reward weighting and frontier exploration (Pareto-optimal reasoning strategies).
Integration of structured external knowledge sources (e.g., dynamic knowledge graphs) to enhance factual consistency and planning efficiency.
Strong empirical and theoretical guarantees of efficiency and effectiveness under sample and cost constraints.
Transferability across diverse domains, frozen QA models, and varying retrieval environments, supporting broad applicability.

Ongoing research may further refine dual-reward balancing, dynamic scheduler adaptation, and reward deconfounding to accommodate more intricate multi-objective planning tasks and evolving alignment requirements.

Table: Summary of Reward Components in Dual-Reward Search Planning

Reward Component	Definition / Metric	Role in Alignment
Outcome Reward	Final answer correctness/comparison	Maximizes task objective (QA accuracy)
Process Reward	Planning trajectory quality/rationality	Promotes efficient, logical reasoning
Cost Reward	Planning turns, query count	Penalties for excessive computation

Dual-reward alignment operationalizes these signals to simultaneously optimize search effectiveness and efficiency, ensuring robustness and generalization in complex multi-step reasoning environments.