Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Published 11 May 2026 in cs.IR, cs.AI, and cs.CL | (2605.10848v1)

Abstract: Does a lexical retriever suffice as LLMs become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at https://github.com/justram/pi-serini.

Summary

  • The paper demonstrates that a well-tuned BM25 retriever paired with advanced LLM agents can outperform dense retriever systems in answer accuracy and cost efficiency.
  • It introduces PI-SERINI, a minimal agentic deep-research system that decouples retriever configuration from LLM interaction by exposing granular tools for evidence retrieval and navigation.
  • Empirical results show that tuning BM25 parameters and increasing retrieval depth yield significant improvements in evidence recall, shifting focus to agentic evidence navigation.

Rethinking Agentic Search: Evaluating Lexical Retrieval with PI-SERINI

Problem Statement and Motivation

The paper, "Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?" (2605.10848), revisits the role of lexical retrieval in deep research systems increasingly driven by frontier LLMs operating in an agentic loop. Predominant consensus in IR/RAG posits the retriever as a hard ceiling on system effectiveness, leading to dense and neural retriever advances aimed at improving evidence recall for complex queries. However, as LLMs substantially improve in reasoning and tool-useโ€”enabling sophisticated multi-step search/inspection/evidence-synthesis pipelinesโ€”the marginal utility of further optimizing the retriever component itself warrants reexamination.

System Architecture: PI-SERINI

PI-SERINI is introduced as a deliberately minimal agentic deep-research system, designed to decouple the influence of retriever configuration, surface area (retrieval depth), and agentic interaction on overall end-to-end task accuracy. The system exposes three granular toolsโ€”search (BM25-backed, up to 1000 docs, with cached rankings), read_search_results (ranked excerpt pagination), and read_document (line-based doc chunks)โ€”allowing the LLM to explicitly retrieve, browse, and consume evidence. Distinct logging of surfaced, previewed, opened, and cited doc tiers provides precise behavioral attribution. Agentic execution is subject to wall-clock timeouts with internal steer mechanisms to ensure cost-effective and consistent evaluation aligned with user-facing latency/compute constraints.

The retriever is ANSERINI BM25, tuned for long-document retrieval, mitigating the limitations of prior baselines using default, short-passage-oriented parameters. The system is evaluated across a spectrum of competitive LLMs (GPT-5.* series, DeepSeek, Claude) to decouple agent/retriever interactions from model-specific limitations.

Experimental Evaluation and Results

On the deep research benchmark BrowseComp-Plus (830 queries, 100k docs), PI-SERINI (BM25 + gpt-5.5) achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming dense retriever-based agents and substantially exceeding previously reported BM25 baselines, which are shown (via ablation) to be confounded by shallow retrieval depth and suboptimal parameterization rather than intrinsic lexical retrieval limitations.

BM25 tuning (e.g., k1=25, b=1) improves answer accuracy by 18.0% and surfaced recall by 11.1% over default settings; increasing retrieval depth from shallow (k=5) to deep (k=1000) adds an additional 25.3% improvement in surfaced recall, confirming that prior lexical retrieval benchmarks significantly understate ceiling performance due to configuration artifacts.

Comparative analysis with dense retriever agents demonstrates that, when paired with capable LLMs, a well-configured BM25 retriever not only matches but often exceeds dense alternatives in both answer accuracy and evidence recall while simultaneously reducing evaluation cost by factors of 3.3xโ€“10x. For instance, PI-SERINI (gpt-5.5+BM25) runs are both more accurate and cheaper than dense retriever/gpt-5.2 pipelines, which can incur prohibitive costs due to high iteration/tool-call counts.

Behavioral Analysis and Ablations

Controlled ablation identifies critical factors driving BM25's resurgence: (1) long-document tuning, (2) large retrieval depth with explicit agent-controlled context management, and (3) prefix-caching in agentic loops for cost efficiency. Importantly, answer accuracy is not monotonically determined by surfaced recallโ€”effective agentic evidence navigation is essential, as evident in the behavioral gap between exposed and actually consumed/used gold evidence.

The error analysis of agent LLMs (e.g., gpt-5.5 versus claude-opus-4.7) reveals differential navigation policies in evidence hypothesis management: gpt-5.5 maintains reversible probing, reverts to original constraints upon unproductive branches, while some agents prematurely commit to weak hypotheses, compounding search errors. Such findings suggest the principal leverage point for improvement may be in the agent's context-allocation and evidence judgment strategy, not the retriever's semantic sophistication.

Implications and Future Directions

The results challenge the prevailing view that dense or neural retrievers are universally superior for deep research. Instead, well-configured lexical retrieval paired with advanced LLM agents suffices for state-of-the-art performance in the studied setting. This shifts the locus of necessary research investment toward agentic evidence navigation, context management, and adaptive querying strategies, rather than further retriever overparameterization.

Future directions include: (1) generalization to multilingual and domain-specialized corpora, (2) richer tool and interface design for more effective evidence disposition/allocation by agents, and (3) flexible, adaptive termination strategies to better match query complexity heterogeneity. Extension to open-domain, dynamic, or streaming corpora presents additional challenges and may partially reinstate the need for semantic matching, but the demonstrated efficacy of tuned BM25 sets a new baseline expectation for lexical retriever utility in agentic deep-RAG-style settings.

Conclusion

The paper empirically substantiates that the underperformance of BM25 in prior deep research benchmarks arises primarily from suboptimal tuning and shallow retrieval depth, not from structural limitations of lexical matching. In the era of highly capable LLMs, PI-SERINI demonstrates that lexical retrieval can be sufficient for deep research tasks under proper agentic orchestration, substantially reducing evaluation/serving cost and increasing interpretability. The bottleneck has shifted from retrieval to evidence navigation, establishing new directions for agent design and evaluation in information-seeking systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper asks a simple but important question: When todayโ€™s AI chatbots get better at thinking and using tools, do we still need fancy, complicated search systems, or can a good โ€œkeyword searchโ€ be enough? To explore this, the authors build a search helper called Pi-Serini and test whether a classic keyword-based method (BM25) can support deep, step-by-step research when paired with strong AI models.

What questions were they trying to answer?

  • If an AI can search, read, and think in steps, is a simple keyword search (called โ€œlexical retrievalโ€) good enough to find the right information?
  • Do we need expensive, meaning-based search systems (โ€œdense retrieversโ€), or can we tune and use keyword search correctly to get similar or better results?
  • How can we make deep research both accurate and cost-effective (not too expensive or slow)?

How did they test it?

Pi-Serini: a simple search helper for AIs

Think of Pi-Serini like a smart research assistant that gives an AI three clear tools, so it acts more like a careful detective:

  • search: run a keyword search and save the ranked list of results for later.
  • read_search_results: browse the saved list in pages, looking at short snippets to decide whatโ€™s promising.
  • read_document: open a specific document and read only the parts that seem relevant.

This setup separates three stepsโ€”finding, skimming, and readingโ€”so the AI doesnโ€™t stuff everything into its short memory at once. It helps the AI choose what to read and avoid wasting space and time.

Time budgeting

The AI has a fixed time per question (for example, 5 minutes). If 70% of the time passes, itโ€™s told to stop searching and write the best answer it can with the evidence it has. This keeps the system realistic and affordable.

The test set and measurements

They use a standard deep research benchmark called BrowseComp-Plus. It has hundreds of questions and a large collection of long documents. The AI must find the right evidence and produce the correct final answer.

They measure:

  • Accuracy: How often the final answer is judged correct.
  • Recall: How often the system surfaces the needed evidence in search results and how often the AI actually previews or reads it.
  • Cost: How much money it takes to run the full test.
  • Tool usage: How many tool calls the AI makes (searches, browses, reads).

They also run โ€œablationโ€ tests. That means they change one thing at a time (like how many results to fetch, or how to tune BM25) to see what really matters.

What did they find?

  • A tuned keyword search worked very well when paired with a strong AI.
    • With Pi-Serini and a strong model (GPT-5.5), the system reached about 83% accuracy and about 95% evidence recall. That means it answered correctly most of the time and usually surfaced the right documents.
    • It outperformed systems that used fancy meaning-based search (โ€œdense retrieversโ€) in this benchmark.
  • Tuning BM25 mattered a lot.
    • Changing BM25โ€™s settings to better handle long documents boosted accuracy by about 18 percentage points and improved evidence recall by about 11 points compared to the default setup.
  • Looking deeper into the search results helped recall.
    • Increasing how many ranked results the system retrieved improved surfaced evidence recall by about 25 points compared to shallow search.
  • Cost was much lower.
    • Pi-Serini made full benchmark runs about 3 to 10 times cheaper than some dense-retriever agents, thanks to:
    • Smart time budgeting (donโ€™t over-search).
    • โ€œPrefix cachingโ€ (reusing repeated text to cut token costs).
    • Better search that avoids wasted steps.
  • Different AI models behave differently.
    • Even strong models can chase the wrong lead. GPT-5.5 was better at backing out when a hypothesis looked weak, while another model, Claude Opus, sometimes kept digging into a wrong guess. The better behavior led to better accuracy.

Why is this important?

  • It challenges the idea that we must always improve complex search components first. When the AI is good at reasoning and tool use, a well-configured keyword search can be enough for deep research in many cases.
  • It makes deep research more practical. Good performance at much lower cost means researchers can run more tests, try more ideas, and improve systems faster.
  • It shows where to focus next: teaching the AI to pick, browse, and read the right evidence from a large, saved list is now a key step.

What does this mean for the future?

  • Donโ€™t overlook simple tools. Classic keyword search (BM25), when tuned and used deeply, can be powerful with modern AIs.
  • Build better tool interfaces. Separating โ€œfind, skim, readโ€ helps the AI handle long documents and limited memory.
  • Spend wisely. Time budgets and caching keep costs low. This makes large experiments and real-world use more affordable.
  • Keep improving AI behavior. The AI still needs guidance to avoid โ€œchasing the wrong clue.โ€ Better strategies for browsing and verifying evidence will push accuracy even higher.

Key terms explained

  • LLM: A powerful AI that can read, write, and reason with text.
  • Agentic loop: The AI works in stepsโ€”think, use a tool, look at the result, think againโ€”like a detective gathering clues.
  • Retriever: A search system that finds relevant documents.
    • Lexical retrieval: Finds documents using keywords (matching the exact words).
    • Dense retriever: Finds documents using meaning (embeddings that capture similarity in meaning).
  • BM25: A popular keyword search method. It scores documents by how well they match the query, adjusting for document length and repeated words.
  • Retrieval depth: How far down the ranked list you look (top-5 vs. top-1000).
  • Recall: Of all the documents you needed, how many did you find or preview?
  • Prefix caching: Reusing repeated parts of prompts or context so you pay less for tokens sent to the AI.

Knowledge Gaps

Below is a consolidated list of concrete knowledge gaps, limitations, and open questions that remain unresolved and could guide follow-up research:

  • External validity beyond BrowseComp-Plus: Does PI-SERINIโ€™s BM25-centric approach generalize to open-web, dynamic, or much larger corpora (106โ€“109 docs), where retrieval depth, latency, and index size become more constraining?
  • Domain and language generalization: How do results transfer to multilingual, low-resource, and domain-specific settings (e.g., biomedical, legal) where lexical mismatch is more severe?
  • Scalability of โ€œdepth-1000โ€ at web scale: What retrieval depths are feasible on very large indexes without prohibitive latency/cost, and how does surfaced/previewed/behavior recall degrade with scale?
  • Fair interface-controlled comparisons: How do dense and hybrid retrievers perform when paired with the same PI-SERINI tool interface (search/read/browse, cached rankings, pagination) and time-budget policy, controlling for confounds present in released baselines?
  • Hybrid retrieval and reranking: What are the gains from adding (1) dense first-stage, (2) sparseโ€“dense hybrids (e.g., SPLADE/uniCOIL), or (3) cross-encoder rerankers atop tuned BM25 within the same agentic loop?
  • Passage- vs document-level indexing: Does switching from whole-document to passage-level indexing improve previewed/behavior recall and reduce unnecessary reading calls on long, noisy documents?
  • Query expansion strategies: How do classical expansions (e.g., RM3, Rocchio), pseudo-relevance feedback, or agent-learned term selection affect surfaced/previewed/behavior recall and final accuracy?
  • Structured lexical querying: Would allowing limited Lucene syntax, fielded queries, phrase/proximity operators, or synonyms/lemmatization/stemming tuning materially improve retrieval for long documents?
  • Reading granularity and chunking: Are line-based pagination and fixed defaults optimal? Would semantic chunking, section detection, or table/figure-aware readers increase behavior recall and reduce tool calls?
  • Evidence navigation and ranking exploration: How can the large surfacedโ€“previewed gap be narrowed (e.g., result clustering/diversification, learning-to-browse policies, โ€œnext-best-evidenceโ€ suggestions)?
  • Adaptive time budgeting: Can per-query difficulty estimation and adaptive termination (rather than a fixed 0.7T steer) deliver better accuracyโ€“latencyโ€“cost trade-offs?
  • Confidence calibration and stopping: How should agents calibrate confidence to decide when to stop searching or invest additional retrieval/reading, and how does this interact with accuracy and calibration error?
  • Judge reliability and metric robustness: How sensitive are accuracy labels to the chosen LLM judge? What is inter-judge agreement, and how do results change with alternative judging prompts/models or human adjudication?
  • Variance and reproducibility: What is the run-to-run variability across random seeds/temperatures, and how stable are gains from BM25 tuning and retrieval depth across repeated trials?
  • Cost portability and deployment constraints: How do results change under different pricing models (no prefix cache, on-prem inference), hardware, and latency SLAs typical of production search systems?
  • Corpus drift and freshness: How robust is the approach under frequent index updates, temporal drift, and newly emerging entities, especially when lexical priors lag reality?
  • Adversarial/noisy corpora: How resilient are agents to misleading content, noise, and near-duplicate saturation, and can retrieval or browsing policies be hardened accordingly?
  • Attribution fidelity and citation quality: Beyond document-level recall, how often are citations faithful and minimal (no over-citation), and do cited spans actually support the claimed answer?
  • Query-type stratification: For which question categories (multi-hop, fuzzy/semantic, entity disambiguation, list aggregation) does lexical retrieval remain sufficient vs where dense/hybrid methods are necessary?
  • Learning the search policy: Can agents be trained (via RL or offline imitation) to issue better lexical queries, browse cached rankings more effectively, and adopt reversible probing to avoid premature branch commitment?
  • BM25 tuning robustness: Do k1/b settings tuned on a 100-query subset overfit? How do the optimal parameters shift across domains, document-length distributions, or after index refreshes?
  • Pagination defaults and UI affordances: How sensitive are previewed/behavior recalls to default page sizes, snippet lengths, and ordering; would UI-like affordances (e.g., โ€œjump to sectionsโ€, quick summaries) help?
  • Interaction with longer context windows: How do larger windows (or retrieval-augmented summarization) change the optimal balance between retrieval depth, browsing, and reading?
  • โ€œTo retrieve or notโ€ under a unified framework: In a controlled setup, when does file-system-style local navigation outperform retrieval, and can a single agent switch between the two regimes based on diagnosable conditions?
  • Ethical and environmental considerations: What are the carbon and energy implications of deeper retrieval, larger LLMs, and prefix caching, and can cost/energy-aware policies be learned without hurting accuracy?

Practical Applications

Immediate Applications

The following applications can be deployed now by adapting the paperโ€™s findings and Pi-Seriniโ€™s design patterns to real systems.

  • Enterprise RAG upgrade: high-recall evidence retrieval with cost control
    • Sectors: software, enterprise search, legal, finance, healthcare
    • What to do now:
    • Replace โ€œretrieve top-k and dump into contextโ€ with a three-tool flow: search(kโ‰ˆ1000) โ†’ read_search_results (paginate excerpts) โ†’ read_document (paginate lines)
    • Tune BM25 for long documents (e.g., k1โ‰ˆ16โ€“25, bโ‰ˆ1) and index all relevant corpora (Anserini)
    • Add time-budget steering (e.g., 300s with submission steer at 0.7T) and enable provider prefix caching
    • Tools/products/workflows: Retrieval Controller microservice; โ€œBrowse-then-Readโ€ agent plugin for LangChain/LlamaIndex; BM25 index pack for long-doc corpora
    • Assumptions/dependencies: Access to an LLM with strong tool-use (frontier class); provider supports prefix caching; corpora are indexable and permissions-cleared; BM25 parameters must be tuned per corpus/domain
  • Cost-managed AI research assistants for analysts and journalists
    • Sectors: media, consulting, finance, academia
    • What to do now: Deploy Pi-Serini-style agents that operate under time budgets and cache-friendly loops to bound expense while surfacing more evidence before answering
    • Tools/products/workflows: โ€œResearch Modeโ€ in knowledge platforms; evidence preview pane with pagination; cost dashboard tied to time budgets and token cache hit-rate
    • Assumptions/dependencies: Stable provider pricing and cache economics; acceptable latency with deeper retrieval; curated or well-scoped corpora
  • Evidence provenance auditing and compliance logging
    • Sectors: healthcare (literature reviews), legal (e-discovery), finance (compliance), public policy
    • What to do now: Use the four-tier evidence log (surfaced/previewed/opened/cited) to audit how answers were formed and to support โ€œshow your workโ€ requirements
    • Tools/products/workflows: Evidence Log schema; auditor dashboards; per-answer citation bundles
    • Assumptions/dependencies: Storage and governance for logs; human review processes; clear policy on citation sufficiency; sensitive-data handling
  • Developer documentation and support search assistants
    • Sectors: software/SaaS
    • What to do now: Index long-form docs and tickets; adopt browse/read pagination to keep tokens low while increasing recall; add time-budget steering to cap support costs
    • Tools/products/workflows: IDE/ChatOps bot with โ€œSearchโ†’Previewโ†’Openโ€ flow; tuned BM25 index for developer docs; prefix-cache-aware prompts
    • Assumptions/dependencies: Up-to-date doc indexing; content chunking strategy for line-based reads; reliable auth to private repos
  • Library/education research helpers
    • Sectors: education, libraries
    • What to do now: Integrate Pi-Serini-style agent into library portals to teach students โ€œsearch, preview, then readโ€ with explicit citations
    • Tools/products/workflows: LMS/library plugin with rank browsing and reading excerpts; educator-facing analytics on evidence coverage
    • Assumptions/dependencies: Licensed access to collections; alignment with academic integrity policies; oversight for answer quality
  • Procurement and benchmarking for public-sector AI systems
    • Sectors: policy, government IT
    • What to do now: Use Pi-Seriniโ€™s metrics (accuracy vs. surfaced/previewed/behavior recall, tool-call counts, cost) to evaluate vendors and set cost/quality SLAs
    • Tools/products/workflows: Standardized test corpora; retrieval-depth and BM25-parameter compliance checklists; mandated time budgets
    • Assumptions/dependencies: Representative benchmarks; availability of a neutral LLM judge or human adjudication; procurement frameworks that allow tool-level requirements
  • Knowledge-base support and deflection in customer service
    • Sectors: customer support, telecom, retail, SaaS
    • What to do now: Tune BM25 for long KB articles; paginate previews to triage relevant answers; enforce time budgets to limit escalation costs
    • Tools/products/workflows: Support agent plugin with evidence previews; auto-citation in responses; routing when previewed recall is low
    • Assumptions/dependencies: Regular index refresh; multilingual considerations if KB is mixed language; escalation policies

Long-Term Applications

These opportunities will benefit from further research, scaling, or productization beyond what the paper directly demonstrates.

  • Web-scale agentic search with lexical-first retrieval
    • Sectors: consumer search, enterprise web monitoring
    • Future direction: Extend the retrieval controller to the open web (crawling/API federation), keeping cached rankings and browse/read decisions; add query reformulation and source quality filters
    • Potential products: โ€œAgentic Web Researchโ€ browser/extension; enterprise web-intelligence monitors
    • Dependencies: Robust, compliant web access; deduplication and freshness; multilingual indexing; anti-scrape and ToS constraints
  • Auto-tuning retriever/controller for long documents and tasks
    • Sectors: software, MLOps, platform teams
    • Future direction: Automated selection of BM25 (k1,b), retrieval depth k, excerpt/line sizes, and submission-steer timing, driven by telemetry and small labeled sets
    • Potential products: โ€œRetriever Auto-Tunerโ€ service; adaptive retrieval-depth scheduler
    • Dependencies: Telemetry collection; offline evaluation labels or weak supervision; safe exploration policies
  • Safety, audit, and regulatory standards for agentic search
    • Sectors: public policy, healthcare, finance, legal
    • Future direction: Standardize evidence logging (surfaced/previewed/opened/cited) and calibration metrics as audit artifacts; certification regimes for agentic answers
    • Potential products: Audit toolkit; compliance reports; third-party attestations
    • Dependencies: Cross-vendor adoption; data retention and privacy frameworks; legal clarity on provenance requirements
  • Domain-grade assistants (clinical, legal, and compliance research)
    • Sectors: healthcare, legal, finance
    • Future direction: Combine tuned lexical retrieval with domain ontologies, de-identification, and human-in-the-loop review; explore when lexical suffices vs. hybrid/dense reranking
    • Potential products: Clinical guideline scanner; e-discovery triage agent; regulatory change monitor
    • Dependencies: Regulatory approval and risk management; high-quality domain corpora; strict provenance and human oversight
  • On-prem/edge private research agents using lexical retrieval
    • Sectors: defense, highly regulated industries, SMEs with privacy constraints
    • Future direction: Deploy Pi-Serini-like stacks on-prem with local LLMs and Anserini; leverage caching and lexical retrieval to minimize compute costs
    • Potential products: โ€œPrivate Research Applianceโ€ with retrieval controller and audit logs
    • Dependencies: Sufficient local compute; secure indexing pipelines; private model/tooling support
  • Human-agent collaborative UIs for evidence triage
    • Sectors: knowledge work across domains
    • Future direction: New interfaces that let users steer the cached ranking (expand/contract, tag, backtrack), addressing premature branch commitment and improving trust
    • Potential products: Evidence maps; interactive rank browsers; โ€œbranch managementโ€ panels
    • Dependencies: UX research; user training; integration with existing research workflows
  • Agent orchestration and cost-aware scheduling
    • Sectors: platform engineering, FinOps
    • Future direction: Controllers that dynamically switch LLMs, retrieval depth, and browsing aggressiveness under time/cost budgets; escalate only when needed
    • Potential products: Time-Budget Middleware; model-switching policy engine; cache-optimization layer
    • Dependencies: Multi-model access; reliable cost/latency signals; acceptance of graceful degradation
  • Multilingual and cross-modal retrieval extensions
    • Sectors: global enterprises, media, academic publishers
    • Future direction: Lexical retrieval enhanced with morphological analyzers and query translation; integrate OCR/ASR and image/table extraction with paginated read tools
    • Potential products: Multilingual research agent; cross-modal evidence reader
    • Dependencies: Language resources; high-quality OCR/ASR; indexing pipelines for non-text assets
  • Robust search policies to mitigate premature branch commitment
    • Sectors: all agentic systems relying on iterative search
    • Future direction: Meta-controllers that detect weak hypotheses, enforce reversible probes, and trigger backtracking; learnable search policies
    • Potential products: โ€œBranch Managerโ€ policy module; failure-mode detectors
    • Dependencies: Behavioral telemetry; training data for failure modes; evaluation benchmarks beyond fixed corpora
  • Hybrid retrieval stacks and plug-and-play evaluation
    • Sectors: IR research, industry labs
    • Future direction: Compose tuned BM25 with light rerankers or sparse+dense hybrids; use Pi-Seriniโ€™s logging to measure marginal benefits under cost budgets
    • Potential products: Hybrid retrieval SDK; experiment harnesses and leaderboards
    • Dependencies: Additional compute for reranking; robust ablation protocols; open benchmarks

Cross-cutting assumptions and dependencies

  • Capable LLMs with strong tool-use are central to observed gains; results vary across model families and may degrade with smaller models.
  • Prefix caching materially affects cost; requires provider support and stable prompting.
  • BM25 parameters and retrieval depth must be tuned to the corpus (especially for long/noisy documents); defaults are often suboptimal.
  • BrowseComp-Plus is a fixed-corpus benchmark; real-world web or dynamic corpora introduce freshness, noise, and scale challenges.
  • Provenance, privacy, and licensing constraints may limit indexing and logging in regulated settings.
  • Automated LLM judging used in the paper should be complemented with human evaluation in high-stakes deployments.

Glossary

  • AgentIR: A reasoning-intensive retriever designed for deep research tasks. "Finally, we report the numbers of AgentIR~\cite{chen2026AgentIR}, a reasoning-intensive retriever trained for deep research."
  • Agentic loop: An iterative interaction pattern where an LLM reasons, takes actions, observes feedback, and updates its behavior. "these systems increasingly operate through an agentic loop, where LLMs receive feedback from their environments"
  • Anserini: An open-source IR toolkit providing BM25 and other retrieval capabilities. "Documents are indexed using BM25 via Anserini~\cite{10.1145/3239571} over the BrowseComp-Plus corpus."
  • Behavior Recall: A recall metric computed over the union of documents the agent opened or cited. "Behavior Recall: recall computed over the union of the document sets $D_{\text{opened} \cup D_{\text{cited}$."
  • BM25: A classic lexical ranking function for information retrieval based on term frequency and length normalization. "We verify the BM25 parameter settings and the retrieval depth to increase the likelihood that relevant documents remain in retrieved results"
  • BM25 tuning: The process of adjusting BM25 parameters (e.g., k1, b) to better suit a corpus or task. "BM25 tuning improves answer accuracy by 18.0\% and surfaced evidence recall by 11.1\% over the default BM25 setting"
  • BrowseComp-Plus: A fixed-corpus deep research benchmark used to evaluate search agents. "On BrowseComp-Plus~\cite{chen2025browsecompplusfairtransparentevaluation}, under time-budget steering, Pi-Serini with #1{gpt-5.5} achieves 83.1\% answer accuracy"
  • Calibration Error: The discrepancy between a modelโ€™s stated confidence and its empirical correctness. "Calibration Error: the discrepancy between the model's confidence and empirical correctness."
  • Dense retriever: A retrieval model that uses learned dense embeddings to match queries and documents semantically. "outperforming released search agents that use dense retrievers."
  • Evidence documents: Documents required to answer a query, not necessarily containing the exact final answer span. "Evidence documents are documents required to answer the query"
  • Gold documents: A stricter subset of evidence documents that semantically contain the final answer. "gold documents are a stricter subset that both support answering and semantically contain the final answer."
  • Interaction trajectory: The sequence of thoughts, actions, and observations an agent accumulates during its loop. "The agent operates over an interaction trajectory:"
  • Length normalization: A BM25 component that adjusts scores to account for document length, important for long documents. "making length normalization and term-frequency saturation matter more than in passage retrieval."
  • Lexical retriever: A retriever that matches queries to documents using exact or approximate term overlap rather than learned semantics. "Does a lexical retriever suffice as LLMs become more capable in an agentic loop?"
  • LLM judge: An LLM used to evaluate answer correctness by comparing a modelโ€™s output with a gold answer. "For answer evaluation, we use an LLM judge."
  • Long-document evidence search: Retrieval over very long documents where relevant information is embedded in extensive text. "whereas tuned parameters better match long-document evidence search."
  • Multi-hop information needs: Queries requiring reasoning over multiple pieces of evidence connected through intermediate steps. "resolving multi-hop information needs"
  • Pareto frontier: The set of solutions that optimally trade off two objectives (e.g., accuracy and cost) where improving one worsens the other. "Pareto frontier"
  • Prefix caching: Caching repeated input prefixes in an LLM session to reduce cost and latency. "making prefix caching central to its cost efficiency."
  • Previewed Recall: A recall metric over documents whose excerpts were shown to the agent when browsing search results. "Previewed Recall: recall computed over the document set $D_{\text{previewed}$;"
  • Rank pagination: Browsing a cached ranking in pages to inspect results without issuing new backend queries. "using rank pagination"
  • ReAct loop: An agent framework where reasoning (thought) and acting (tool use) alternate iteratively. "The LLM agent runs a ReAct loop"
  • Reasoning-aware retriever: A retriever that incorporates reasoning signals to handle complex queries. "Recent reasoning-aware retrievers further target complex queries by capturing implicit intent and resolving multi-hop information needs"
  • Retrieval-Augmented Generation (RAG): A paradigm where generation is grounded in documents retrieved from an external corpus. "information-seeking systems such as Retrieval-Augmented Generation (RAG)~\cite{10.5555/3495724.3496517}"
  • Retrieval controller: A component mediating all retrieval access and exposing constrained tool APIs to the agent. "a retrieval controller mediates all access to an Anserini BM25 backend."
  • Retrieval depth: How far down the ranked list a system retrieves (e.g., top-1000), affecting recall. "increasing retrieval depth further improves surfaced evidence recall by 25.3\% over the shallow-retrieval setting."
  • Reranker: A model that reorders initial retrieval results to improve ranking quality. "dense retrievers, sparse retrievers, and rerankers to improve early-stage ranking accuracy"
  • Reranking: The process of reordering retrieved documents, often using more expensive models or features. "studies how reranking and document ordering affect search agents in deep research."
  • Shallow retrieval: A setting that retrieves only a small top portion of results, often hurting recall in complex tasks. "Naive search agents often use shallow retrieval settings that insert the full texts of all ranked documents directly into the context window."
  • Sparse retriever: A lexical or term-based retriever that relies on sparse representations (e.g., term frequencies). "dense retrievers, sparse retrievers, and rerankers to improve early-stage ranking accuracy"
  • Submission steer: A runtime directive prompting the agent to stop using tools and finalize an answer before timeout. "the system injects a submission steer that instructs the agent to stop using tools and produce its best answer from the evidence collected so far."
  • Surfaced Recall: A recall metric over documents initially returned by the search tool. "Surfaced Recall: recall computed over the document set $D_{\text{surfaced}$;"
  • Time-budget steering: A policy that guides the agent to complete within a fixed wall-clock time rather than a fixed iteration count. "we use time-budget steering instead of the fixed iteration cap used in prior work"
  • Tool call: An agent action invoking an external function (e.g., search or read) during the loop. "the action ata_t is a tool call or reasoning"
  • Top-k: The number of top-ranked results to return from retrieval, controlled by the parameter k. "returns the top-kk search results, with k=5k=5."
  • Wall-clock time budget: A hard real-time limit for completing a query or experiment. "allowing the agent to terminate under a hard wall-clock time budget."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 166 likes about this paper.