Papers
Topics
Authors
Recent
Search
2000 character limit reached

GrepSeek: Shell-Based Corpus Search

Updated 2 June 2026
  • GrepSeek is a direct corpus interaction search agent that uses Unix-style shell commands for precise, multi-hop evidence retrieval in large text corpora.
  • It leverages a two-stage training pipeline with supervised fine-tuning and group relative policy optimization to generate interpretable, stepwise search behaviors.
  • Its sharded-parallel execution engine processes millions of text lines with byte-exact accuracy, achieving up to 7.6× speedup for full-corpus scans.

GrepSeek is a direct corpus interaction (DCI) search agent designed to train and operate LLM-based search agents capable of issuing shell-style commands over raw text corpora for evidence identification and multi-hop reasoning. Unlike retriever-augmented generation (RAG) approaches that depend on precomputed document representations and fixed chunking, GrepSeek enables agents to interact with large corpora by composing fine-grained, executable pipelines of standard Unix tools. Its architecture, training, and execution environment offer interpretable multi-step search primitives and demonstrate competitive empirical performance across a range of open-domain question answering (QA) tasks (Salemi et al., 28 May 2026).

1. Direct Corpus Interaction Agent Architecture

GrepSeek formulates the search task as interaction with a raw text corpus CC (e.g., 21 million Wikipedia lines in JSONL format), treating it as an environment. At each step tt, the agent’s policy πθ\pi_\theta receives the question qq along with the history of previous reasoning traces (t1,…,tnt_1,\ldots,t_n), shell command actions (a1,…,ana_1,\ldots,a_n), and their textual observations (o1,…,ono_1,\ldots,o_n), then outputs:

  • A free-text reasoning trace (tn+1t_{n+1})
  • A shell pipeline command embedded in Hermes XML (e.g., <tool_call>{"name":"shell","arguments":{"command":"..."}}</tool_call>)
  • An optional terminal answer action (<answer>…</answer>)

Allowed shell commands are constructed as single-stage pipelines combining fixed-string grep/rg, awk, sed, head, tail, wc, cut, sort, uniq, and tr. The execution engine runs these commands over CC (or over corpus shards), returning the corresponding raw text output to the agent. All pipelines are single, acyclic; complex shell features such as chaining or redirection are disallowed.

This framework enables precise, arbitrarily granular evidence retrieval via exact matching, boolean filters, regular expressions, and compositional filtering, instead of being restricted to recall-centric top-k document retrieval. It is particularly effective for tasks requiring symbolic accuracy, entity bridge construction, or retrieval of rare/structurally unique surface forms.

2. Two-Stage Training Pipeline

The GrepSeek agent is trained using a two-phase pipeline optimized for both causal search structure and end-to-end answer quality.

Stage 1: Cold-Start Dataset Construction

Given QA pairs D={(qi,yi)}D = \{ (q_i, y_i) \}, a two-agent framework is constructed:

  • Tutor (tt0): answer-aware, able to decompose tt1 and tt2 into a sequence of search-oriented sub-queries tt3 and shell commands via backward chaining.
  • Planner (tt4): answer-blind, infers reasoning and actions in forward chronological order.

The Tutor executes a backward phase, decomposing the question to the final answer by iterative masked shell commands and bridge entity extraction. The planner then reconstructs a chronological trajectory with causally grounded rationales, validated by the Tutor for fidelity and filtered for quality (tt5). Only trajectories causally leading to a correct answer without future leakage are retained.

Stage 2: Supervised Fine-Tuning

Compiled trajectories tt6, containing tt7 pairs, are used to maximize the log-likelihood objective tt8. This phase teaches the policy structured use of shell pipelines, causally aligned with evidence.

3. Group Relative Policy Optimization (GRPO)

Post SFT, the agent undergoes reinforcement learning via group relative policy optimization (GRPO). For each query:

  • Sample tt9 trajectories πθ\pi_\theta0 from πθ\pi_\theta1.
  • Each trajectory yields a format indicator (πθ\pi_\theta2) and answer reward (πθ\pi_\theta3).
  • Compute group statistics πθ\pi_\theta4, and per-trajectory advantage πθ\pi_\theta5.
  • Update πθ\pi_\theta6 using a PPO-style surrogate loss averaged over the group:

πθ\pi_\theta7

where πθ\pi_\theta8.

GRPO stabilizes credit assignment by favoring trajectories with stepwise outperformance relative to their peers for the same query, leading to concise, interpretable search behaviors and improved sample efficiency.

4. Semantics-Preserving Sharded-Parallel Execution Engine

To achieve practical inference speeds on corpora exceeding ten million lines, GrepSeek uses a parallel, semantics-preserving execution architecture:

  • Corpus Sharding: πθ\pi_\theta9 is split into qq0 contiguous, line-aligned shards qq1 for embarrassingly parallel execution.
  • Pipeline Classification: Each command pipeline is classified as one of 5 reduction types ({CONCAT, HEAD(N), COUNT, SORTHEAD(N), SEQUENTIAL}). Stateless pipelines (e.g., without awk context) are executed in parallel and reduced deterministically; unsafe or context-dependent pipelines fall back to sequential execution.
  • In all cases, byte-exact output equivalence with the non-sharded, sequential baseline is guaranteed.
  • Optimizations: Data is stored in a RAM filesystem; tools use mmap; environment is fixed (LC_ALL=C); a persistent daemon avoids repeated Python startup. I/O telemetry identifies fallback cases.

Empirically, sharding delivers up to qq2 speedup, with latency dropping nearly linearly with qq3 from qq4 s (qq5) to qq6 s (qq7) for full-corpus scans. This architecture is critical for enabling interactive, retrieval-augmented LLM workflows at web-scale.

5. Experimental Results and Comparative Analysis

GrepSeek was evaluated across seven QA benchmarks spanning both single-hop and multi-hop settings (e.g., NaturalQuestions, TriviaQA, PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle). The corpus consists of the 2018 Wikipedia dump (21M passages, 14GB). Models were compared using micro-averaged token-level qq8 (primary), with EM reported in the appendix.

Model NQ TriviaQA PopQA HotpotQA 2Wiki MuSiQue Bamboogle Micro-avg
Direct 0.273 0.557 0.236 0.284 0.335 0.115 0.165 0.334
BM25-RAG 0.333 0.666 0.324 0.443 0.347 0.131 0.284 0.413
E5-RAG 0.507 0.707 0.447 0.421 0.323 0.149 0.338 0.460
Qwen3-RAG 0.500 0.721 0.505 0.455 0.350 0.161 0.348 0.491
Search-R1 (best dense) 0.507 0.773 0.475 0.559 0.430 0.288 0.699 0.544
GrepSeek 0.522 0.767 0.486 0.623 0.518 0.301 0.621 0.569
  • GrepSeek achieved the highest micro-average qq9 and won 4/7 benchmarks by t1,…,tnt_1,\ldots,t_n0 (statistically significant, t1,…,tnt_1,\ldots,t_n1). Gains concentrate on multi-hop scenarios and bridge-entity questions.
  • Latency was t1,…,tnt_1,\ldots,t_n2 s per query, with LLM decoding (t1,…,tnt_1,\ldots,t_n3 s) dominating tool time (t1,…,tnt_1,\ldots,t_n4 s). Memory footprint is t1,…,tnt_1,\ldots,t_n5 GB (raw text), lower than E5 (t1,…,tnt_1,\ldots,t_n6 GB) and Qwen3 (t1,…,tnt_1,\ldots,t_n7 GB) due to the absence of a full retrieval index. Offline preprocessing was minimal (t1,…,tnt_1,\ldots,t_n81 min).
  • Ablations demonstrate that both cold-start SFT and GRPO are indispensable (full: t1,…,tnt_1,\ldots,t_n9; no GRPO: a1,…,ana_1,\ldots,a_n0; no SFT: a1,…,ana_1,\ldots,a_n1). Qualitative analysis shows that typically 70–80% of commands used fixed-string matching, with prevalent use of cascaded grep-style filtering or pipelines ending in | head. RL refinements increased the average lines retrieved per command.

6. Context, Limitations, and Future Directions

GrepSeek highlights DCI as a viable complement to retriever-centered RAG approaches—prioritizing lexical precision and evidence traceability at the potential expense of surface-form flexibility and ranking. In tasks with brittle formats or extensive linguistic variation (e.g., PopQA, Bamboogle), performance may decline, as lexical pipelines can be sensitive to diacritics, spelling variants, and file-order dependencies.

Potential future enhancements include:

  • Fuzzy matching in pipelines (approximate grep), addressing brittle surface forms
  • Integration of semantic ranking at the execution level
  • Compressed, on-disk corpus shards for larger datasets
  • More compact representation of trajectory histories to further reduce inference time
  • Hybrid models combining neural retrievers for coarse filtering with DCI agents for surgical evidence extraction

Applications beyond QA include large-scale code search, log mining, spoken language understanding corpus analysis, and interactive exploratory data discovery (Salemi et al., 28 May 2026).

A plausible implication is that as corpora and precision requirements increase, DCI-based agents like GrepSeek may become integral to hybrid retrieval-generation pipelines in academic, industrial, and exploratory research workflows.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GrepSeek.