Papers
Topics
Authors
Recent
2000 character limit reached

LocalSearchBench: Agentic Search Benchmark

Updated 9 December 2025
  • LocalSearchBench is an offline evaluation framework for multi-hop search in local service domains, integrating 150,031 anonymized merchant profiles and 300 complex user queries.
  • It leverages agentic reasoning in realistic scenarios by simulating multi-constraint searches across major cities with high data fidelity (augmentation: 0.8596, privacy: 0.9217).
  • The benchmark employs iterative planning and retrieval via LocalPlayground, validating responses through multi-round tool calls that assess correctness, completeness, and faithfulness.

LocalSearchBench is an offline evaluation framework and benchmark designed to rigorously assess the agentic reasoning and multi-hop search capabilities of large reasoning models (LRMs) within the vertical domain of local life services, such as dining, shopping, accommodation, travel, healthcare, and tourism. By providing a large-scale, high-fidelity dataset of merchant profiles and multi-hop user queries, along with a unified evaluation environment and domain-specific retrieval tools, LocalSearchBench targets real-world scenarios that demand complex, multi-constraint reasoning which is not covered by prior general-domain or single-hop QA benchmarks (He et al., 8 Dec 2025).

1. Formal Definition, Scope, and Objectives

LocalSearchBench is defined formally as the tuple

B=(D,Q,M)\mathbb{B} = (\mathcal{D}, \mathcal{Q}, \mathcal{M})

where D\mathcal{D} is a structured database containing N=150,031N = 150{,}031 anonymized merchant profiles, Q\mathcal{Q} is a set of 300 multi-hop user queries grounded in actual local service interactions, and M\mathcal{M} is the automated evaluation framework (LocalPlayground). The benchmark is specialized to evaluate the capacity of LRMs to perform agentic, multi-stage search, integrating constraints over entities (merchants), attributes (hours, ratings, price, location), and events (promotions, temporal dynamics).

The objectives are:

  • To supply a realistic, geographically and semantically diverse source of local life service data.
  • To impose evaluation via real-world, ambiguous, and multi-step queries mimicking complex user intents.
  • To enable systematic assessment using agent workflows integrated with domain-specific tools and reporting of correctness, completeness, and faithfulness.

2. Dataset Composition and Characteristics

The merchant dataset D\mathcal{D} covers three major Chinese cities (Shanghai: 52,314; Beijing: 48,007; Guangzhou: 49,710), spanning six life service categories with proportional representation (Dining: 35%, Lifestyle: 25%, Shopping: 20%, Accommodation: 10%, Healthcare: 5%, Tourism: 5%). Each merchant profile includes 29 anonymized and augmented fields (up from an initial 12): expansion was performed via LLM-based synthesis, and anonymization used a privacy rewriting agent. The database spans 859 landmarks and 43 city districts.

Quality was validated both by LLM judgment and human assessment, resulting in augmentation scores of 0.8596 and privacy rewriting scores of 0.9217 (scale: 0–1). This ensures high data fidelity and privacy compliance throughout the benchmark.

3. Multi-Hop Query and Answer Pipeline

The set Q\mathcal{Q} was constructed via a multistage process:

  • Seed Collection: 1,200 single-hop user questions were sampled (90% from logs, 10% manual), stratified into intelligence levels L1–L5. LocalSearchBench emphasizes L3 (“Composite multi-turn”) and L4 (“Personalized planning”).
  • Question Instantiation: For each city, ~100 queries were grounded using specific landmarks, pricing, and local events. Annotators designed each instance to require 3–5 reasoning hops involving cross-merchant comparison, spatiotemporal planning, event bundling, and multi-constraint resolution.
  • Answer Collection: Two-stage pipeline: candidates retrieved with LocalRAG (semantic/geographic retriever), followed by LLM generation (GPT-5, Claude-4.1) and expert revision, ensuring factual grounding and clarity.
  • Validation: Triple-expert revision, with filtering to enforce answerability from D\mathcal{D} and balanced coverage.

A prototypical 3-hop question: “Find a restaurant near the Forbidden City open after 9 PM with rating \geq4.5, then identify a dessert shop within 500 m that offers dine-in and costs \leq¥50, and finally recommend a tea house on the same street with a loyalty discount.”

The final query set: 70% L3, 30% L4; 45% 3-hop, 35% 4-hop, 20% 5-hop.

4. LocalPlayground Evaluation Framework

LocalPlayground is the agentic environment for LocalSearchBench. It provides:

  • Tool interfaces:
    • LocalRAG: Dense retrieval (cosine similarity, reranking) over merchant data.
    • Web Search: Live Baidu Search API for facts such as weather, news, events.
  • Agent workflow: Each LRM processes up to N=5N=5 rounds: observes query and prior context, plans actions (<web_search>, <rag_search>), executes at most one retrieval call of each type per round, and aggregates evidence for answer generation.

The interaction protocol follows a recurrent planning–retrieval–reflection loop, with agents required to chain together reasoning steps, integrate multi-source evidence, and handle ambiguous or incomplete information paths.

  • Validation: An independent LLM judge (Claude-Sonnet-4) evaluates answers against reference, scoring seven dimensions, from which three main metrics—correctness, completeness, and faithfulness—are computed.

5. Evaluation Metrics and Efficiency Reporting

For M=300M=300 benchmark questions, scores are defined:

  • Correctness: 1Mj=1Mcj\frac{1}{M} \sum_{j=1}^M c_j, cj{0,1}c_j \in \{0,1\}
  • Completeness: 110Mj=1Mαj\frac{1}{10M} \sum_{j=1}^M \alpha_j, αj[0,10]\alpha_j \in [0,10]
  • Faithfulness: 110Mj=1Mϕj\frac{1}{10M} \sum_{j=1}^M \phi_j, ϕj[0,10]\phi_j \in [0,10]

Average tool calls and average rounds per question are tracked to quantify agentic search efficiency.

Model Tool Calls Rounds Correctness Completeness Faithfulness
DeepSeek-V3.1 3.43 4.02 34.34% 80.00% 60.80%
GLM-4.5 2.73 3.66 33.78% 76.76% 73.12%
LongCat-L (32K) 2.73 3.22 33.19% 80.51% 60.80%
Qwen-Plus 2.59 3.12 32.79% 80.94% 68.68%
Gemini-2.5-Pro 1.89 2.86 26.09% 77.93% 78.26%
Average 2.43 3.13 29.95% 77.33% 61.99%

Notable findings: Web Search capability boosts correctness (+4.37pp) and completeness (+3.95pp), but introduces a trade-off with diminished faithfulness (–3.64pp). Closed-source LRMs outperformed open-source variants in completeness and used fewer tool calls.

Error analysis reveals failure modes centered on constraint handling, missed retrievals, hallucinations from web search, and incomplete step chaining.

6. Core Challenges and Proposed Research Directions

LocalSearchBench exposes major challenges, including the integration of location, temporal, and promotional constraints, the management of noisy multi-source evidence (especially from web search), and the limited domain adaptation of LRMs trained on general corpora.

Proposed trajectories for advancing agentic search in this context include:

  • Domain-specific pretraining of retrieval and reasoning modules on local life corpora.
  • Joint optimization of retrieval and reasoning stages to mitigate hallucination.
  • Simulation of dynamic, real-time, and cross-platform planning tasks (towards AGI-level evaluation).
  • Multi-modal integration, incorporating structured and unstructured data (maps, menus, imagery).
  • Reinforcement learning for adaptive query refinement and self-reflection loops (He et al., 8 Dec 2025).

This specialization is necessitated by the inadequacy of general agentic search benchmarks, which typically fail to model vertical domain complexity, multi-hop planning, and the multi-constraint, multi-entity semantics that define local life services.

The design of LocalSearchBench draws on foundational principles identified in benchmarking search algorithms for NLP adversarial generation (Yoo et al., 2020), emphasizing the modular isolation of search primitives: algorithm, transformation, constraint, and query budget. In adversarial settings, fair comparisons require a constant search space when evaluating algorithms, strict budget enforcement, and reproducibility through fixed random seeds and detailed per-step logging.

Although application domains differ, the guidance for robust benchmark design—in particular, isolating evaluation axes, facilitating extensibility to novel search paradigms (e.g., MCTS, submodular greedy), and supporting multi-modality—translates directly. This modularity enables researchers to analyze trade-offs in agentic search efficiency, accuracy, and semantic fidelity, and systematically compare agent architectures under controlled, reproducible protocols (Yoo et al., 2020).

A plausible implication is that, as with NLP adversarial search, future LocalSearchBench variants may benefit from integrating alternative reasoning and retrieval pipelines, variable budget constraints, and extended task formats (e.g., cross-platform, real-time simulation), further enabling fine-grained empirical analysis of agentic search performance and domain transferability.


References:

  • "LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services" (He et al., 8 Dec 2025)
  • "Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples" (Yoo et al., 2020)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LocalSearchBench.