Papers
Topics
Authors
Recent
Search
2000 character limit reached

PI-SERINI: Minimal BM25 Research Agent

Updated 14 May 2026
  • PI-SERINI is a minimal agentic search system that uses a well-tuned BM25 backend to achieve near-oracle evidence recall on deep-research benchmarks.
  • It decouples retrieval, browsing, and reading through a controlled agent-tool loop, enabling explicit and efficient evidence management.
  • Empirical evaluations on BrowseComp-Plus demonstrate that optimized BM25 tuning significantly improves answer accuracy while reducing computational costs.

Pi-Serini is a minimal agentic search system designed to systematically evaluate whether properly configured lexical retrieval, specifically BM25, can suffice in “deep-research” pipelines powered by modern LLMs (Hsu et al., 11 May 2026). In contrast to prevailing trends toward dense and reasoning-aware retrievers, Pi-Serini leverages a well-tuned BM25 backend and an explicit agent-tool loop to match or exceed the performance of denser, more complex systems on demanding research benchmarks. By decoupling retrieval, browsing, and reading actions within a controlled agentic loop, Pi-Serini provides insight into the sufficiency of lexical baselines and optimizes the agent–retriever interface for high-evidence recall and answer accuracy.

1. Motivation and Foundational Question

The field of deep research systems—often cast as multi-step Retrieval-Augmented Generation (RAG) or ReAct-style agents—has traditionally positioned retriever quality as an upper bound (“hard ceiling”) on answer performance. As such, research commonly advances toward sophisticated retriever architectures: dense embedding retrievers, zero-shot semantic matchers, and even retrievers capable of explicit reasoning. However, as frontier LLMs become increasingly proficient at planning, tool use, and iterative reflection, a critical question arises: is continual innovation in retriever design necessary, or does a well-configured lexical retriever suffice in the context of an LLM-driven agentic loop? Pi-Serini is proposed to disentangle these factors, testing whether previous BM25 baselines were artificially limited due to shallow recall or sub-optimal parameterization rather than true lexical retrieval constraints.

2. Architectural Structure and Agentic Loop

At its core, Pi-Serini implements a ReAct-style agentic loop, wherein the LLM alternates between “thinking” (producing reasoning traces) and “acting” (issuing tool calls) until a conclusive answer is produced. The agent interfaces with a Retrieval Controller that exposes three instrumented tools:

  • search(reason, query): Submits a BM25 query, caching the ranking (up to 1,000 hits) with a unique search_id. Returns the top 5 excerpts initially.
  • read_search_results(reason, search_id, offset, limit): Enables paginated browsing of cached search results without re-querying the backend.
  • read_document(reason, docid, offset, limit): Facilitates streaming reads of individual documents in a line-based fashion.

The Retriever Controller maintains logs of four document sets: surfaced (DsurfacedD_\text{surfaced}), previewed, opened, and cited—enabling granular measurement of retrieval effectiveness at each stage of evidence access. Pi-Serini operates under a two-stage time-budget regime, defaulting to T=300T = 300 seconds per query, with a “submit-now” steer issued at t=0.7Tt = 0.7T to prompt answer generation and curtail further tool use. This architecture empowers the LLM to control retrieval depth and context window insertion, moving beyond simplistic “top-kk stuffing” strategies and facilitating explicit evidence management.

3. BM25 Retrieval Formalism and Tuning

Pi-Serini utilizes Anserini’s BM25 implementation. The BM25 scoring function for document DD and query QQ is:

Score(D,Q)=tQIDF(t)f(t,D)(k1+1)f(t,D)+k1(1b+bD/avgdl)\text{Score}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{f(t, D) \cdot (k_1 + 1)}{f(t, D) + k_1 \cdot (1 - b + b|D|/\text{avgdl})}

where f(t,D)f(t, D) is the frequency of term tt in DD, T=300T = 3000 the document length, T=300T = 3001 the average document length, T=300T = 3002 controls term-frequency saturation, T=300T = 3003 controls document-length normalization, and T=300T = 3004.

Experiments on BrowseComp-Plus (documents averaging T=300T = 30052,000 tokens; 90th percentile T=300T = 300614,000 tokens) demonstrated that vanilla BM25 defaults (T=300T = 3007, T=300T = 3008) were inadequate for long-document ranking. Grid search across a 100-query subset established that high parameter values (T=300T = 3009, t=0.7Tt = 0.7T0) are optimal. Pi-Serini adopts t=0.7Tt = 0.7T1, t=0.7Tt = 0.7T2 throughout, yielding substantial improvements in recall and downstream answer performance.

4. Retrieval Depth, Evidence Recall, and Agent-Interaction

Pi-Serini systematically explores the impact of retrieval depth—the number of search hits (t=0.7Tt = 0.7T3) cached by the initial search tool. Results indicate that:

  • At t=0.7Tt = 0.7T4 (a common shallow default), surfaced recall—fraction of evidence docs in t=0.7Tt = 0.7T5—is approximately 70%.
  • Increasing to t=0.7Tt = 0.7T6 lifts surfaced recall to ~86%; at t=0.7Tt = 0.7T7, recall plateaus near this level.
  • Maximum depth (t=0.7Tt = 0.7T8) achieves surfaced recall of 95.8%, nearly matching the oracle level (i.e., BM25 is exposed to almost all required evidence).

However, peak previewed recall (fraction of evidence the agent actually browses) is t=0.7Tt = 0.7T9 at kk0, indicating that increased retrieval depth aids utility only if the agent appropriately explores the available results. Boosting from default shallow settings (kk1) to deep settings (kk2) yields a kk3 surfaced-recall gain.

5. Benchmarking on BrowseComp-Plus

Evaluation occurs on BrowseComp-Plus, a fixed-corpus deep-research benchmark (830 queries; ~100,000 documents). Each query averages kk4 evidence documents and kk5 gold documents. The evaluation protocol incorporates:

  • Accuracy: Judged by gpt-5.3-codex, assessing final exact-answer match with ground truth, allowing trivial rephrasings.
  • Surfaced Recall: Recall over kk6.
  • Previewed Recall: Recall over kk7.
  • Behavior Recall: Recall over kk8.
  • Time Budget and Cost: 300s per query, “submit-now” cut-off at kk9, with per-query cost in USD using standard token pricing.

6. Empirical Performance and Ablation Results

Pi-Serini is evaluated across several LLMs, including DeepSeek Flash/Pro, Claude Haiku/Opus, and OpenAI’s GPT-5, 5.2, 5.4, and 5.5. All experiments employ the tuned BM25 at maximum retrieval depth (DD0). Key results with GPT-5.5:

Metric Value
Answer accuracy 83.1%
Surfaced evidence recall 94.7%
Previewed recall 73.6%
Behavior recall 58.9%
Total cost (USD) \$291.6

Comparative benchmarks show that Pi-Serini outperforms prior dense-retriever agents (e.g., GPT-5 + qwen3-embed-8b achieves 73.0% accuracy and 79.0% surfaced recall at a cost of \$D1k1=0.91k_1=0.9D$2b=0.4$D3k1=253k_1=25D$4b=1$D$5k$D66\simD77\sim23).

7. Contributions, Implications, and Practical Takeaways

Pi-Serini makes three primary contributions:

  • Definition of a minimal search-agent framework that cleanly dissociates retrieval, browsing, and reading processes, with instrumented, paginated tools for expressive evidence management.
  • A comprehensive reassessment of BM25’s efficacy on deep-research benchmarks, revealing that prior weaknesses are more attributable to sub-optimal tuning and shallow retrieval than to intrinsic lexical limitations.
  • Empirical validation that BM25-based agents can equal or surpass dense-retriever systems on BrowseComp-Plus, while substantially reducing computational cost and highlighting clear optimization levers (parameterization, retrieval depth, tool design).

A plausible implication is that future advances in deep-research systems may derive more from optimizing agent-recognized evidence management and navigation than from further incremental sophistication in retriever architectures themselves. Ensuring proper BM25 tuning and sufficient retrieval depth should be a foundational step before pursuing more complex retrieval solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PI-SERINI.