DynaSearcher: Multi-Step Search Agent

Updated 29 July 2025

DynaSearcher is a multi-step search agent that integrates LLM reasoning with dynamic knowledge graphs and multi-reward reinforcement learning to optimize complex information retrieval.
It employs an iterative 'Think → Search → Retrieve → Answer' loop to enforce structured reasoning and mitigate hallucinations and redundant computations.
The system demonstrates superior retrieval efficiency and accuracy across multi-hop QA benchmarks, showcasing robust scalability and adaptability in diverse search environments.

DynaSearcher is a multi-step search agent for complex information retrieval that augments LLM reasoning with dynamic knowledge graph context and a multi-reward reinforcement learning framework. Its design addresses the core challenges of factually inconsistent intermediate queries and inefficient search trajectories in agentic retrieval, enforcing structured reasoning and evidence-based intermediate steps to minimize hallucination and redundant computations. The architecture fuses structured (knowledge graph) and unstructured (document retrieval) evidence, employing fine-grained multi-objective optimization to improve retrieval accuracy, efficiency, and answer quality while promoting generalization and scalability across retrieval environments and model scales (Hao et al., 23 Jul 2025).

1. System Architecture

DynaSearcher employs an iterative agentic reasoning cycle, alternating between LLM-driven stepwise planning and external retrieval. The system pipeline comprises:

Reflection and Planning Module: Given a user query, the LLM decomposes it into sub-questions and formulates structured JSON requests containing extracted entities and predicted relations. This explicit decomposition, as shown in Table 1 of the original paper, provides the search tools with precise query semantics at each step.
Retrieval Tools:
- Document Search Tool: Utilizes dense vector retrieval (on a local corpus or via web search APIs such as Tavily) to surface unstructured textual passages.
- Knowledge Graph (KG) Search Tool: Interfaces with Wikidata5M, performing entity/relation matching and retrieving relevant one-hop subgraphs capturing explicit entity relationships.
Iterative "Think → Search → Retrieve → Answer" Loop: Each cycle, the agent reflects, generates the next subquery, invokes search tools, integrates returned evidence (both from documents and KG), and continues until an answer is produced. A dedicated KG filter module prunes spurious graph triples before reintegration, mitigating distraction from noisy or irrelevant subgraph expansions.

The architecture is visualized in the system illustration (Figure 1 in the paper), where each iteration fuses newly gathered evidence into the evolving context for subsequent reasoning stages.

2. Reinforcement Learning Framework

The agent is fine-tuned with a multi-reward reinforcement learning (RL) objective that integrates several distinct signal components:

Accuracy Reward: Assessed on the final answer by evaluating both output format correctness (enforcing adherence to the required stepwise protocol) and answer content—using a conditional combination of word-level F1 and Cover Exact Match (CEM):

$r_{\text{acc}} = \begin{cases} \max(0.1, r_{\text{ans}}), & \text{if format is correct} \ 0, & \text{otherwise} \end{cases}$

$r_{\text{ans}} = \begin{cases} F_1(a_{\text{pred}}, a_{\text{gt}}), & L_{\text{pred}} \geq n\cdot L_{\text{gt}} \ CEM(a_{\text{pred}}, a_{\text{gt}}), & \text{otherwise} \end{cases}$

Information Gain Reward: Positively rewards generation of intermediate queries that retrieve new, relevant evidence. The retrieval reward uses document recall: $r_{\text{recall}} = \frac{TP}{TP + FN}$ , with $TP$ the count of relevant ground-truth documents surfaced.
Penalty Reward: Imposes a decaying penalty for unnecessary search hops or excessive retrieval:

$r_{\text{penalty}} = \max(\beta, 1 - \gamma^{t - i})$

where $t$ is the total retrieval actions and $i$ the minimum ground-truth number of necessary hops.

Combined RL Objective: The overall RL reward is

$r_{\text{overall}} = r_{\text{outcome}} + r_{\text{gain}}$

where $r_{\text{overall}}$ combines final answer quality and intermediate search efficiency.

Optimization is performed using GRPO, with a regularization term enforcing policy proximity to a reference model.

3. Knowledge Graph Integration

Explicit exploitation of dynamic knowledge graphs is central to DynaSearcher’s strategy for improving factual consistency and reasoning discipline:

Entity/Relation Extraction: In each step, entity mentions and relation hypotheses are extracted and matched fuzzily to nodes/edges in Wikidata5M, retrieving one-hop knowledge graph subgraphs aligned with the current subquery.
Guiding Intermediate Reasoning: KG evidence (triples) is used to constrain the space of subsequent queries by providing factual anchor points—e.g., when determining the author of a specific work, the agent reasons with concrete author-entity edges, reducing generation of spurious or inconsistent intermediate queries.
Filtering Noisy Structure: A KG filter module removes irrelevant or low-salience triples, only supplying high-confidence, contextually pertinent subgraphs to the agent.

Structured information from the KG thus actively curbs LLM hallucination, biases search away from distractors, and minimizes redundant exploration. The approach is illustrated in a multi-hop QA case study, where KG entities and relations are iteratively pieced together to fulfill complex relational constraints.

4. Experimental Results and Performance Benchmarks

DynaSearcher outperforms both prompt-based and supervised (fine-tuned) agentic search baselines across six multi-hop question answering datasets, including HotpotQA, 2WikiMultiHopQA, and Musique. Key findings include:

Superior F1, CEM, and EM scores relative to methods such as Search-o1 and DeepSeek-R1.
Robustness in data-limited regimes: DynaSearcher achieves strong scores in constrained retrieval settings (e.g., 4K context/top1 results), often matching or outperforming larger-context and multi-pass baselines.
Reduced computational footprint: Multi-reward RL discourages surplus retrieval, yielding lower step counts and improved efficiency.
High generalization and stability across both local dense retrieval and online web search environments (via Tavily), as well as scalability from small models (Qwen2.5-7B) upwards, demonstrating the method's versatility.

LLM-as-Judge (LasJ) assessments on out-of-domain data confirm the approach’s retrieval efficiency and answer correctness under unseen distributions.

5. Generalization and Robustness

The agentic design and multi-objective RL confer several robustness properties:

Cross-environment applicability: DynaSearcher maintains accuracy and efficiency when ported across dense document corpora, dynamic web indices, and different KG backends.
Scalability with model and corpus size: The penalty mechanism and structured KG filtering adapt to increased information volumes, preserving efficient reasoning trajectories under scale.
Ablation Analysis: Removal of KG filtering or information gain rewards demonstrably degrades both retrieval efficiency and answer quality, confirming that both components are essential to robust performance. The agent's ability to autonomously modulate retrieval granularity avoids overfitting to static IR environments.

6. Applications and Implications

Several practical and research implications arise from DynaSearcher’s design:

Complex Multi-Hop QA: DynaSearcher is particularly suited for multi-hop settings that require explicit entity/relation chaining (e.g., biomedical, legal, scientific, or encyclopedic question answering).
Retrieval-Augmented Generation: The framework can be adapted for generation tasks requiring structured factual grounding, such as knowledge-based summarization or answer synthesis.
Extensible to Multimodal/Structured Sources: The system architecture admits extension to richer structured resources (tables, graphs, images), enabling broader types of agentic reasoning.
Agentic Search Automation: DynaSearcher’s autonomous reasoning and search strategy selection point toward future search agents that balance answer quality, factuality, and efficiency in real-world and resource-constrained settings.
Research Directions: The demonstrated utility of multi-reward RL in training retrieval-augmented agents suggests further exploration of richer objective spaces, potentially including uncertainty or user preference signals.

Summary Table: Core DynaSearcher Innovations

Component	Role	Impact
Dynamic KG Integration	Guides reasoning with explicit facts/entities	Factual consistency, reduced hallucination
Multi-Reward RL	Balances answer, retrieval, and efficiency goals	High QA accuracy, efficient trajectories
Iterative Agentic Loop	"Think → Search → Retrieve → Answer" cycles	Redundancy minimized, subgoal composition
KG Filtering Module	Prunes irrelevant subgraphs before LLM reasoning	Robustness, information focus

DynaSearcher exemplifies a retrieval-augmented LLM agent distinguished by explicit knowledge graph integration, multi-reward RL optimization, and robust adaptation to diverse search environments and tasks (Hao et al., 23 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent via Multi-Reward Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DynaSearcher.