PI-SERINI: Minimal BM25 Research Agent

Updated 14 May 2026

PI-SERINI is a minimal agentic search system that uses a well-tuned BM25 backend to achieve near-oracle evidence recall on deep-research benchmarks.
It decouples retrieval, browsing, and reading through a controlled agent-tool loop, enabling explicit and efficient evidence management.
Empirical evaluations on BrowseComp-Plus demonstrate that optimized BM25 tuning significantly improves answer accuracy while reducing computational costs.

Pi-Serini is a minimal agentic search system designed to systematically evaluate whether properly configured lexical retrieval, specifically BM25, can suffice in “deep-research” pipelines powered by modern LLMs (Hsu et al., 11 May 2026). In contrast to prevailing trends toward dense and reasoning-aware retrievers, Pi-Serini leverages a well-tuned BM25 backend and an explicit agent-tool loop to match or exceed the performance of denser, more complex systems on demanding research benchmarks. By decoupling retrieval, browsing, and reading actions within a controlled agentic loop, Pi-Serini provides insight into the sufficiency of lexical baselines and optimizes the agent–retriever interface for high-evidence recall and answer accuracy.

1. Motivation and Foundational Question

The field of deep research systems—often cast as multi-step Retrieval-Augmented Generation (RAG) or ReAct-style agents—has traditionally positioned retriever quality as an upper bound (“hard ceiling”) on answer performance. As such, research commonly advances toward sophisticated retriever architectures: dense embedding retrievers, zero-shot semantic matchers, and even retrievers capable of explicit reasoning. However, as frontier LLMs become increasingly proficient at planning, tool use, and iterative reflection, a critical question arises: is continual innovation in retriever design necessary, or does a well-configured lexical retriever suffice in the context of an LLM-driven agentic loop? Pi-Serini is proposed to disentangle these factors, testing whether previous BM25 baselines were artificially limited due to shallow recall or sub-optimal parameterization rather than true lexical retrieval constraints.

2. Architectural Structure and Agentic Loop

At its core, Pi-Serini implements a ReAct-style agentic loop, wherein the LLM alternates between “thinking” (producing reasoning traces) and “acting” (issuing tool calls) until a conclusive answer is produced. The agent interfaces with a Retrieval Controller that exposes three instrumented tools:

search(reason, query): Submits a BM25 query, caching the ranking (up to 1,000 hits) with a unique search_id. Returns the top 5 excerpts initially.
read_search_results(reason, search_id, offset, limit): Enables paginated browsing of cached search results without re-querying the backend.
read_document(reason, docid, offset, limit): Facilitates streaming reads of individual documents in a line-based fashion.

The Retriever Controller maintains logs of four document sets: surfaced ( $D_\text{surfaced}$ ), previewed, opened, and cited—enabling granular measurement of retrieval effectiveness at each stage of evidence access. Pi-Serini operates under a two-stage time-budget regime, defaulting to $T = 300$ seconds per query, with a “submit-now” steer issued at $t = 0.7T$ to prompt answer generation and curtail further tool use. This architecture empowers the LLM to control retrieval depth and context window insertion, moving beyond simplistic “top- $k$ stuffing” strategies and facilitating explicit evidence management.

3. BM25 Retrieval Formalism and Tuning

Pi-Serini utilizes Anserini’s BM25 implementation. The BM25 scoring function for document $D$ and query $Q$ is:

$\text{Score}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{f(t, D) \cdot (k_1 + 1)}{f(t, D) + k_1 \cdot (1 - b + b|D|/\text{avgdl})}$

where $f(t, D)$ is the frequency of term $t$ in $D$ , $T = 300$ 0 the document length, $T = 300$ 1 the average document length, $T = 300$ 2 controls term-frequency saturation, $T = 300$ 3 controls document-length normalization, and $T = 300$ 4.

Experiments on BrowseComp-Plus (documents averaging $T = 300$ 52,000 tokens; 90th percentile $T = 300$ 614,000 tokens) demonstrated that vanilla BM25 defaults ( $T = 300$ 7, $T = 300$ 8) were inadequate for long-document ranking. Grid search across a 100-query subset established that high parameter values ( $T = 300$ 9, $t = 0.7T$ 0) are optimal. Pi-Serini adopts $t = 0.7T$ 1, $t = 0.7T$ 2 throughout, yielding substantial improvements in recall and downstream answer performance.

4. Retrieval Depth, Evidence Recall, and Agent-Interaction

Pi-Serini systematically explores the impact of retrieval depth—the number of search hits ( $t = 0.7T$ 3) cached by the initial search tool. Results indicate that:

At $t = 0.7T$ 4 (a common shallow default), surfaced recall—fraction of evidence docs in $t = 0.7T$ 5—is approximately 70%.
Increasing to $t = 0.7T$ 6 lifts surfaced recall to ~86%; at $t = 0.7T$ 7, recall plateaus near this level.
Maximum depth ( $t = 0.7T$ 8) achieves surfaced recall of 95.8%, nearly matching the oracle level (i.e., BM25 is exposed to almost all required evidence).

However, peak previewed recall (fraction of evidence the agent actually browses) is $t = 0.7T$ 9 at $k$ 0, indicating that increased retrieval depth aids utility only if the agent appropriately explores the available results. Boosting from default shallow settings ( $k$ 1) to deep settings ( $k$ 2) yields a $k$ 3 surfaced-recall gain.

5. Benchmarking on BrowseComp-Plus

Evaluation occurs on BrowseComp-Plus, a fixed-corpus deep-research benchmark (830 queries; ~100,000 documents). Each query averages $k$ 4 evidence documents and $k$ 5 gold documents. The evaluation protocol incorporates:

Accuracy: Judged by gpt-5.3-codex, assessing final exact-answer match with ground truth, allowing trivial rephrasings.
Surfaced Recall: Recall over $k$ 6.
Previewed Recall: Recall over $k$ 7.
Behavior Recall: Recall over $k$ 8.
Time Budget and Cost: 300s per query, “submit-now” cut-off at $k$ 9, with per-query cost in USD using standard token pricing.

6. Empirical Performance and Ablation Results

Pi-Serini is evaluated across several LLMs, including DeepSeek Flash/Pro, Claude Haiku/Opus, and OpenAI’s GPT-5, 5.2, 5.4, and 5.5. All experiments employ the tuned BM25 at maximum retrieval depth ( $D$ 0). Key results with GPT-5.5:

Metric	Value
Answer accuracy	83.1%
Surfaced evidence recall	94.7%
Previewed recall	73.6%
Behavior recall	58.9%
Total cost (USD)	\$291.6

Comparative benchmarks show that Pi-Serini outperforms prior dense-retriever agents (e.g., GPT-5 + qwen3-embed-8b achieves 73.0% accuracy and 79.0% surfaced recall at a cost of \$D $1k_1=0.9$ D$2b=0.4$D $3k_1=25$ D$4b=1$D$5k$D $6\sim$ D $7\sim$ 23).

7. Contributions, Implications, and Practical Takeaways

Pi-Serini makes three primary contributions:

Definition of a minimal search-agent framework that cleanly dissociates retrieval, browsing, and reading processes, with instrumented, paginated tools for expressive evidence management.
A comprehensive reassessment of BM25’s efficacy on deep-research benchmarks, revealing that prior weaknesses are more attributable to sub-optimal tuning and shallow retrieval than to intrinsic lexical limitations.
Empirical validation that BM25-based agents can equal or surpass dense-retriever systems on BrowseComp-Plus, while substantially reducing computational cost and highlighting clear optimization levers (parameterization, retrieval depth, tool design).

A plausible implication is that future advances in deep-research systems may derive more from optimizing agent-recognized evidence management and navigation than from further incremental sophistication in retriever architectures themselves. Ensuring proper BM25 tuning and sufficient retrieval depth should be a foundational step before pursuing more complex retrieval solutions.

Markdown Report Issue Upgrade to Chat

References (1)

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient? (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PI-SERINI.