Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pi-Serini: Lexical Retrieval for Deep Research

Updated 2 July 2026
  • Pi-Serini is a research search agent that uses a minimal tool-based interface with BM25-driven lexical retrieval to enable deep, multi-step investigation.
  • It employs a ReAct-style agent loop and a dedicated retrieval controller to separate search, browsing, and reading tasks effectively.
  • Empirical results show that deep retrieval depth and tuned BM25 parameters boost accuracy and evidence recall while reducing operational costs.

Pi-Serini is a research search agent that operationalizes a minimal tool-based interface for retrieval, browsing, and reading documents, designed to investigate whether advanced LLMs in an agentic loop can achieve state-of-the-art deep research performance when paired solely with a well-tuned lexical retriever. Developed and evaluated in "Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?" (Hsu et al., 11 May 2026), Pi-Serini provides comprehensive evidence that, under proper configuration and sufficient retrieval depth, classical lexical retrieval (BM25) suffices to power deep multi-step research tasks in settings previously dominated by dense or neural retriever architectures.

1. System Architecture and Agentic Workflow

Pi-Serini is built on a ReAct-style agentic framework in which a LLM such as GPT-5.5 orchestrates multi-step tool use. At each invocation:

  • The LLM reasons based on the interaction history and selects among three atomic tools: search, read_search_results, and read_document.
  • A dedicated Retrieval Controller mediates between the LLM and a BM25 backend (Anserini). This controller caches up to 1,000 retrieved documents per query and exposes a minimal API, enforcing strictly iterative (“no dump”) information acquisition.
  • Control flow executes as: (1) agent prompt and query, (2) LLM emission of a tool call, (3) Retrieval Controller action, (4) information returned as partial snippets or document chunks, (5) looped reasoning until the agent emits Explanation, Exact Answer, and Confidence.
  • A time-budget steering policy imposes a 300 s latency cap with an auto-submit trigger at 70% of this window, balancing empirical cost against real-world research constraints.

The three distinct tools operate as follows:

Tool Functionality Access Granularity
search BM25 keyword search (k₁=25, b=1), top-1,000 cached, returns top 5 excerpts Document ranking/excerpts
read_search_results Paginated browsing of cached ranking, default offset=6, limit=10 Snippet pagination
read_document Line-based chunked document access by ID, default offset=1, limit=200 Doc partial content

This agentic loop, enforced at the tool API boundary, strictly separates retrieval, preview (browsing), and deep reading, allowing precise measurement and control of evidence emergence and consumption within the session.

2. Lexical Retrieval with Tuned BM25

Pi-Serini exclusively employs the BM25 ranking function: BM25(q,d)=tqIDF(t)f(t,d)(k1+1)f(t,d)+k1[1b+bd/avgdl]\mathrm{BM25}(q,d) = \sum_{t\in q} \mathrm{IDF}(t) \, \frac{f(t,d)\,(k_1+1)}{f(t,d) + k_1\big[1 - b + b \, |d|/\mathrm{avgdl}\big]} where f(t,d)f(t,d) is the term frequency, d|d| the token length, avgdl\mathrm{avgdl} the average document length, k1k_1 controls term-frequency saturation, bb controls document-length normalization, and IDF(t)\mathrm{IDF}(t) is the term’s inverse document frequency.

Contrary to previous practices, Pi-Serini applies aggressive parameter tuning:

  • Default Anserini BM25 (k₁=0.9, b=0.4) is replaced via grid search—on 100 validation queries—by k₁=25, b=1.0, especially to boost recall for long and noisy documents.
  • Final configuration demonstrably improves both answer accuracy (+18.0%) and surfaced evidence recall (+11.1%) over default BM25 on BrowseComp-Plus.

These empirical observations decouple retriever architecture capacity from suboptimal configuration, showing that under-tuned BM25, not lexical retrieval per se, accounts for historical performance deficits in agentic search.

3. Retrieval Depth and Ablation Findings

Pi-Serini defaults to retrieval depth k=1,000, contrasting sharply with the shallow k=5 typical of baseline agents. This deep cut ensures that the majority of supporting evidence is available to the agent after first-stage retrieval.

Ablation studies reveal:

  • surfaced recall of evidence rises sharply with retrieval depth (e.g., from 70.5% at k=5 to 95.8% at k=1,000),
  • tuning BM25 (from default to k₁=25, b=1.0) independently adds both answer accuracy and recall,
  • previewed recall saturates under browsing limits imposed by the agent’s inspection budget (e.g., 70.9% at k=1,000, reflecting bounded browse capacity).

These findings establish that retrieval model performance must be evaluated in combination with both retrieval depth and inspection strategy, not retrieval architecture alone.

4. Empirical Performance on BrowseComp-Plus

Evaluated on BrowseComp-Plus—a benchmark comprising 830 multi-step research queries, ~100,000 long-form documents, a median of ~2,000 tokens per doc, and dense gold/evidence annotations—Pi-Serini achieves:

  • 83.1% answer accuracy (GPT-5.5 model)
  • 94.7% surfaced evidence recall (evidence present in top-k cache)
  • 73.6% previewed recall (evidence paged via agent browsing)
  • Cost efficiency: \$291.6 total for Pi-Serini versus \$400.4 (GPT-5+BM25) and \$360.7 (GPT-5+Qwen3)
  • Substantial outperformance of dense retriever agent baselines (e.g., GPT-5+Qwen3 at 73.0% accuracy, AgentIR at 68.1%)

Metrics are measured via LLM judge for semantic answer equivalence, with explicit tracking of evidence recalls at each interaction tier.

5. Implications and Research Insights

Empirical results robustly support the thesis that, in agentic deep research, the primary bottleneck is no longer the raw capacity of dense or neural retrievers. Instead, with sufficient tuning and retrieval depth, lexical retrieval attains or eclipses dense models at lower cost.

Key insights include:

  • Classical low-performing BM25 baselines reflect under-tuning and insufficient depth, not lexical retrieval limits.
  • The principal challenge in modern agentic loops becomes context management: recognition, prioritization, and allocation of surfaced evidence within the agent’s constrained context window.
  • Architectural focus should shift from retriever sophistication to agentic strategies for evidence navigation, selective reading, and efficient tool orchestration.

6. Design Principles and Methodological Recommendations

On the basis of Pi-Serini’s findings, several best practices emerge for agentic research systems:

  1. Always tune BM25 hyperparameters (k₁, b) to match target document length and domain noise.
  2. Exploit deep retrieval (large k) to maximize evidence recall—especially critical for high-recall tasks.
  3. Implement incrementally granular tool interfaces (search → browse → read) to enable LLMs to manage deep candidate pools effectively.
  4. Use session prefix caching and time-budget policies to control empirical costs and maintain tractable API usage under real-world latency constraints.
  5. Direct research attention to the agent’s internal navigation algorithms, not solely to first-stage document retrievers.

A plausible implication is that further improvements in deep research agents may be most efficiently achieved by advancing agentic inspection heuristics, adaptive context pruning, and evidence prioritization mechanisms.

7. Broader Context and Future Directions

Pi-Serini reframes the research agent design space by showing that effective orchestration of simple, well-tuned retrieval tools suffices for high-level performance in LLM-centered search, challenging the perceived necessity of complex dense retriever integration. This shift redirects innovation toward optimizing multi-tool dialogue, evidence triage, and agentic context compression, while reinforcing the foundational importance of retrieval configuration as a precondition for downstream research effectiveness (Hsu et al., 11 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pi-Serini.