PI-SERINI: Minimal BM25 Research Agent
- PI-SERINI is a minimal agentic search system that uses a well-tuned BM25 backend to achieve near-oracle evidence recall on deep-research benchmarks.
- It decouples retrieval, browsing, and reading through a controlled agent-tool loop, enabling explicit and efficient evidence management.
- Empirical evaluations on BrowseComp-Plus demonstrate that optimized BM25 tuning significantly improves answer accuracy while reducing computational costs.
Pi-Serini is a minimal agentic search system designed to systematically evaluate whether properly configured lexical retrieval, specifically BM25, can suffice in “deep-research” pipelines powered by modern LLMs (Hsu et al., 11 May 2026). In contrast to prevailing trends toward dense and reasoning-aware retrievers, Pi-Serini leverages a well-tuned BM25 backend and an explicit agent-tool loop to match or exceed the performance of denser, more complex systems on demanding research benchmarks. By decoupling retrieval, browsing, and reading actions within a controlled agentic loop, Pi-Serini provides insight into the sufficiency of lexical baselines and optimizes the agent–retriever interface for high-evidence recall and answer accuracy.
1. Motivation and Foundational Question
The field of deep research systems—often cast as multi-step Retrieval-Augmented Generation (RAG) or ReAct-style agents—has traditionally positioned retriever quality as an upper bound (“hard ceiling”) on answer performance. As such, research commonly advances toward sophisticated retriever architectures: dense embedding retrievers, zero-shot semantic matchers, and even retrievers capable of explicit reasoning. However, as frontier LLMs become increasingly proficient at planning, tool use, and iterative reflection, a critical question arises: is continual innovation in retriever design necessary, or does a well-configured lexical retriever suffice in the context of an LLM-driven agentic loop? Pi-Serini is proposed to disentangle these factors, testing whether previous BM25 baselines were artificially limited due to shallow recall or sub-optimal parameterization rather than true lexical retrieval constraints.
2. Architectural Structure and Agentic Loop
At its core, Pi-Serini implements a ReAct-style agentic loop, wherein the LLM alternates between “thinking” (producing reasoning traces) and “acting” (issuing tool calls) until a conclusive answer is produced. The agent interfaces with a Retrieval Controller that exposes three instrumented tools:
- search(reason, query): Submits a BM25 query, caching the ranking (up to 1,000 hits) with a unique search_id. Returns the top 5 excerpts initially.
- read_search_results(reason, search_id, offset, limit): Enables paginated browsing of cached search results without re-querying the backend.
- read_document(reason, docid, offset, limit): Facilitates streaming reads of individual documents in a line-based fashion.
The Retriever Controller maintains logs of four document sets: surfaced (), previewed, opened, and cited—enabling granular measurement of retrieval effectiveness at each stage of evidence access. Pi-Serini operates under a two-stage time-budget regime, defaulting to seconds per query, with a “submit-now” steer issued at to prompt answer generation and curtail further tool use. This architecture empowers the LLM to control retrieval depth and context window insertion, moving beyond simplistic “top- stuffing” strategies and facilitating explicit evidence management.
3. BM25 Retrieval Formalism and Tuning
Pi-Serini utilizes Anserini’s BM25 implementation. The BM25 scoring function for document and query is:
where is the frequency of term in , 0 the document length, 1 the average document length, 2 controls term-frequency saturation, 3 controls document-length normalization, and 4.
Experiments on BrowseComp-Plus (documents averaging 52,000 tokens; 90th percentile 614,000 tokens) demonstrated that vanilla BM25 defaults (7, 8) were inadequate for long-document ranking. Grid search across a 100-query subset established that high parameter values (9, 0) are optimal. Pi-Serini adopts 1, 2 throughout, yielding substantial improvements in recall and downstream answer performance.
4. Retrieval Depth, Evidence Recall, and Agent-Interaction
Pi-Serini systematically explores the impact of retrieval depth—the number of search hits (3) cached by the initial search tool. Results indicate that:
- At 4 (a common shallow default), surfaced recall—fraction of evidence docs in 5—is approximately 70%.
- Increasing to 6 lifts surfaced recall to ~86%; at 7, recall plateaus near this level.
- Maximum depth (8) achieves surfaced recall of 95.8%, nearly matching the oracle level (i.e., BM25 is exposed to almost all required evidence).
However, peak previewed recall (fraction of evidence the agent actually browses) is 9 at 0, indicating that increased retrieval depth aids utility only if the agent appropriately explores the available results. Boosting from default shallow settings (1) to deep settings (2) yields a 3 surfaced-recall gain.
5. Benchmarking on BrowseComp-Plus
Evaluation occurs on BrowseComp-Plus, a fixed-corpus deep-research benchmark (830 queries; ~100,000 documents). Each query averages 4 evidence documents and 5 gold documents. The evaluation protocol incorporates:
- Accuracy: Judged by gpt-5.3-codex, assessing final exact-answer match with ground truth, allowing trivial rephrasings.
- Surfaced Recall: Recall over 6.
- Previewed Recall: Recall over 7.
- Behavior Recall: Recall over 8.
- Time Budget and Cost: 300s per query, “submit-now” cut-off at 9, with per-query cost in USD using standard token pricing.
6. Empirical Performance and Ablation Results
Pi-Serini is evaluated across several LLMs, including DeepSeek Flash/Pro, Claude Haiku/Opus, and OpenAI’s GPT-5, 5.2, 5.4, and 5.5. All experiments employ the tuned BM25 at maximum retrieval depth (0). Key results with GPT-5.5:
| Metric | Value |
|---|---|
| Answer accuracy | 83.1% |
| Surfaced evidence recall | 94.7% |
| Previewed recall | 73.6% |
| Behavior recall | 58.9% |
| Total cost (USD) | \$291.6 |
Comparative benchmarks show that Pi-Serini outperforms prior dense-retriever agents (e.g., GPT-5 + qwen3-embed-8b achieves 73.0% accuracy and 79.0% surfaced recall at a cost of \$DD$2b=0.4$DD$4b=1$D$5k$DD23).
7. Contributions, Implications, and Practical Takeaways
Pi-Serini makes three primary contributions:
- Definition of a minimal search-agent framework that cleanly dissociates retrieval, browsing, and reading processes, with instrumented, paginated tools for expressive evidence management.
- A comprehensive reassessment of BM25’s efficacy on deep-research benchmarks, revealing that prior weaknesses are more attributable to sub-optimal tuning and shallow retrieval than to intrinsic lexical limitations.
- Empirical validation that BM25-based agents can equal or surpass dense-retriever systems on BrowseComp-Plus, while substantially reducing computational cost and highlighting clear optimization levers (parameterization, retrieval depth, tool design).
A plausible implication is that future advances in deep-research systems may derive more from optimizing agent-recognized evidence management and navigation than from further incremental sophistication in retriever architectures themselves. Ensuring proper BM25 tuning and sufficient retrieval depth should be a foundational step before pursuing more complex retrieval solutions.