Pi-Serini: Lexical Retrieval for Deep Research
- Pi-Serini is a research search agent that uses a minimal tool-based interface with BM25-driven lexical retrieval to enable deep, multi-step investigation.
- It employs a ReAct-style agent loop and a dedicated retrieval controller to separate search, browsing, and reading tasks effectively.
- Empirical results show that deep retrieval depth and tuned BM25 parameters boost accuracy and evidence recall while reducing operational costs.
Pi-Serini is a research search agent that operationalizes a minimal tool-based interface for retrieval, browsing, and reading documents, designed to investigate whether advanced LLMs in an agentic loop can achieve state-of-the-art deep research performance when paired solely with a well-tuned lexical retriever. Developed and evaluated in "Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?" (Hsu et al., 11 May 2026), Pi-Serini provides comprehensive evidence that, under proper configuration and sufficient retrieval depth, classical lexical retrieval (BM25) suffices to power deep multi-step research tasks in settings previously dominated by dense or neural retriever architectures.
1. System Architecture and Agentic Workflow
Pi-Serini is built on a ReAct-style agentic framework in which a LLM such as GPT-5.5 orchestrates multi-step tool use. At each invocation:
- The LLM reasons based on the interaction history and selects among three atomic tools:
search,read_search_results, andread_document. - A dedicated Retrieval Controller mediates between the LLM and a BM25 backend (Anserini). This controller caches up to 1,000 retrieved documents per query and exposes a minimal API, enforcing strictly iterative (“no dump”) information acquisition.
- Control flow executes as: (1) agent prompt and query, (2) LLM emission of a tool call, (3) Retrieval Controller action, (4) information returned as partial snippets or document chunks, (5) looped reasoning until the agent emits
Explanation,Exact Answer, andConfidence. - A time-budget steering policy imposes a 300 s latency cap with an auto-submit trigger at 70% of this window, balancing empirical cost against real-world research constraints.
The three distinct tools operate as follows:
| Tool | Functionality | Access Granularity |
|---|---|---|
| search | BM25 keyword search (k₁=25, b=1), top-1,000 cached, returns top 5 excerpts | Document ranking/excerpts |
| read_search_results | Paginated browsing of cached ranking, default offset=6, limit=10 | Snippet pagination |
| read_document | Line-based chunked document access by ID, default offset=1, limit=200 | Doc partial content |
This agentic loop, enforced at the tool API boundary, strictly separates retrieval, preview (browsing), and deep reading, allowing precise measurement and control of evidence emergence and consumption within the session.
2. Lexical Retrieval with Tuned BM25
Pi-Serini exclusively employs the BM25 ranking function: where is the term frequency, the token length, the average document length, controls term-frequency saturation, controls document-length normalization, and is the term’s inverse document frequency.
Contrary to previous practices, Pi-Serini applies aggressive parameter tuning:
- Default Anserini BM25 (k₁=0.9, b=0.4) is replaced via grid search—on 100 validation queries—by k₁=25, b=1.0, especially to boost recall for long and noisy documents.
- Final configuration demonstrably improves both answer accuracy (+18.0%) and surfaced evidence recall (+11.1%) over default BM25 on BrowseComp-Plus.
These empirical observations decouple retriever architecture capacity from suboptimal configuration, showing that under-tuned BM25, not lexical retrieval per se, accounts for historical performance deficits in agentic search.
3. Retrieval Depth and Ablation Findings
Pi-Serini defaults to retrieval depth k=1,000, contrasting sharply with the shallow k=5 typical of baseline agents. This deep cut ensures that the majority of supporting evidence is available to the agent after first-stage retrieval.
Ablation studies reveal:
- surfaced recall of evidence rises sharply with retrieval depth (e.g., from 70.5% at k=5 to 95.8% at k=1,000),
- tuning BM25 (from default to k₁=25, b=1.0) independently adds both answer accuracy and recall,
- previewed recall saturates under browsing limits imposed by the agent’s inspection budget (e.g., 70.9% at k=1,000, reflecting bounded browse capacity).
These findings establish that retrieval model performance must be evaluated in combination with both retrieval depth and inspection strategy, not retrieval architecture alone.
4. Empirical Performance on BrowseComp-Plus
Evaluated on BrowseComp-Plus—a benchmark comprising 830 multi-step research queries, ~100,000 long-form documents, a median of ~2,000 tokens per doc, and dense gold/evidence annotations—Pi-Serini achieves:
- 83.1% answer accuracy (GPT-5.5 model)
- 94.7% surfaced evidence recall (evidence present in top-k cache)
- 73.6% previewed recall (evidence paged via agent browsing)
- Cost efficiency: \$291.6 total for Pi-Serini versus \$400.4 (GPT-5+BM25) and \$360.7 (GPT-5+Qwen3)
- Substantial outperformance of dense retriever agent baselines (e.g., GPT-5+Qwen3 at 73.0% accuracy, AgentIR at 68.1%)
Metrics are measured via LLM judge for semantic answer equivalence, with explicit tracking of evidence recalls at each interaction tier.
5. Implications and Research Insights
Empirical results robustly support the thesis that, in agentic deep research, the primary bottleneck is no longer the raw capacity of dense or neural retrievers. Instead, with sufficient tuning and retrieval depth, lexical retrieval attains or eclipses dense models at lower cost.
Key insights include:
- Classical low-performing BM25 baselines reflect under-tuning and insufficient depth, not lexical retrieval limits.
- The principal challenge in modern agentic loops becomes context management: recognition, prioritization, and allocation of surfaced evidence within the agent’s constrained context window.
- Architectural focus should shift from retriever sophistication to agentic strategies for evidence navigation, selective reading, and efficient tool orchestration.
6. Design Principles and Methodological Recommendations
On the basis of Pi-Serini’s findings, several best practices emerge for agentic research systems:
- Always tune BM25 hyperparameters (k₁, b) to match target document length and domain noise.
- Exploit deep retrieval (large k) to maximize evidence recall—especially critical for high-recall tasks.
- Implement incrementally granular tool interfaces (search → browse → read) to enable LLMs to manage deep candidate pools effectively.
- Use session prefix caching and time-budget policies to control empirical costs and maintain tractable API usage under real-world latency constraints.
- Direct research attention to the agent’s internal navigation algorithms, not solely to first-stage document retrievers.
A plausible implication is that further improvements in deep research agents may be most efficiently achieved by advancing agentic inspection heuristics, adaptive context pruning, and evidence prioritization mechanisms.
7. Broader Context and Future Directions
Pi-Serini reframes the research agent design space by showing that effective orchestration of simple, well-tuned retrieval tools suffices for high-level performance in LLM-centered search, challenging the perceived necessity of complex dense retriever integration. This shift redirects innovation toward optimizing multi-tool dialogue, evidence triage, and agentic context compression, while reinforcing the foundational importance of retrieval configuration as a precondition for downstream research effectiveness (Hsu et al., 11 May 2026).