Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Published 14 May 2026 in cs.CL | (2605.15184v1)

Abstract: Recent advances in LLM agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that inline lexical retrieval using grep consistently outperforms vector retrieval across various agent harness architectures under realistic noise conditions.
It employs a full-factorial experimental design on the LongMemEval benchmark to systematically evaluate retrieval strategies, harness configurations, and tool-calling modes.
The study underscores that the interplay between retrieval methods and harness architectures is critical, challenging assumptions about semantic retrieval dominance in complex workflows.

Agent Harnesses and Retrieval in Agentic Search: Empirical Insights

Introduction

The increasing computational agency of LLMs has shifted the retrieval paradigm from static RAG pipelines to dynamically agentic search: agents now autonomously generate search queries, invoke tools iteratively, and consume retrieval outputs while orchestrating complex workflows. Despite pervasive adoption of both semantic and lexical retrieval in RAG systems, interactions between retrieval strategies and agent harness architectures have lacked comprehensive evaluation, particularly in the context of tool-call mediation and corpus noise. The paper "Is Grep All You Need? How Agent Harnesses Reshape Agentic Search" (2605.15184) provides a multifactorial empirical analysis addressing these gaps with controlled experiments on the LongMemEval benchmark. The study's systematic manipulation of retrieval mode, harness architecture, and tool-calling mechanism yields nuanced insights on agentic retrieval under realistic, noise-prone corpora.

Retrieval Strategies and Harness Architectures

The study distinguishes between three principal retrieval strategies: lexical (e.g., grep, BM25), semantic (vector/dense retrieval), and hybrids. Lexical systems perform exact or pattern-based matching with minimal overhead, excelling in scenarios where relevant evidence persists as literal spans. Semantic retrieval, conversely, leverages ANN-based similarity in embedding space, enabling resilient handling of paraphrase and vocabulary mismatch but with increased infrastructure and susceptibility to topic-drift false positives.

Agent harness architecture, a frequently underappreciated variable, is shown to exert as much influence on retrieval efficacy as the choice of retriever itself. The authors evaluate both custom harnesses (e.g., Chronos, with category-conditioned prompting, fine-grained tool control, and explicit management over context window utilization) and provider-native CLI harnesses (e.g., Claude Code, Codex, Gemini CLI), which afford less explicit process management but integrate more directly with shell utilities. The harness’s impact is evident in substantial differences in accuracy when swapping harness for a fixed model and retrieval method.

Of particular note is tool-calling presentation: results surfaced inline (directly in context) versus programmatic (via file artifacts). Inline delivery, though context-limited, enables immediate agent action; programmatic delivery alleviates context-bandwidth constraints but adds complexity to agent workflows (requiring explicit file access and integration), introducing new opportunities for failure or degradation.

Experimental Methodology

Two experiments structure the empirical analysis. Experiment 1 executes a full-factorial comparison of grep versus vector retrieval across harnesses and tool-calling modes, using a 116-question LongMemEval subset. Experiment 2 stresses the system by progressively increasing irrelevant distractor sessions, simulating deployments over growing, noisy corpora and measuring degradation profiles.

For all experiments, retrieval occurs across both unstructured conversation turns and normalized temporal event records (via Chronos preprocessing), decoupling retrieval capability from downstream temporal reasoning. Outputs are rigorously evaluated using GPT-4o as an automated grader under fixed prompts and scoring rubrics, minimizing evaluation noise.

Empirical Findings

Retrieval–Harness Interactions

The principal result is that inline lexical retrieval (grep) uniformly surpasses vector retrieval across all evaluated harness–model pairs in the standard configuration. The advantage is particularly pronounced in noise-prone, span-centric queries (e.g., temporal reasoning, explicit preference recall). For instance, Chronos with Claude Opus 4.6 achieves 93.1% with inline grep versus 83.6% with inline vector; Codex with GPT-5.4 matches this peak with grep but suffers a sharp drop to 75.9% with vector retrieval. These results decisively contradict expectations that dense retrieval is strictly necessary for large or noisy corpora.

However, this superiority is not universal: the use of programmatic (file-based) delivery frequently reshuffles the rankings, and in multiple harness–backbone pairs, vector retrieval outperforms grep when agents must explicitly read back file contents. The worst regression observed (Codex with programmatic grep, dropping to 55.2%) underscores that retrieval success is highly non-transitive across interaction and orchestration layers.

Corpus Noise and Retrieval Robustness

Experiment 2 confirms practitioner intuitions only partially: as irrelevant context increases, both grep and vector search show resilience up to moderate noise levels, but their relative ranking is highly variable and backbone/harness-dependent. Vector retrieval is often more robust at modest scale (small session limits), but grep can close or reverse the gap as the session limit increases and span-centric evidence becomes more critical. Crucially, these crossovers are not monotonic with corpus size nor solely a function of retrieval quality—they emerge from holistic system dynamics, including how agents schedule queries, stop searching, or recover from ambiguous intermediate outputs.

Additionally, stable inductive biases are observed per vendor/tool-stack, not explainable by corpus identity alone. For example, provider-native CLIs with Gemini models frequently exhibit a persistent vector advantage, likely due to default prompting, shell output formatting, or internal tool integration differing from those in Chronos or Claude-based stacks.

Practical and Theoretical Implications

The study presents strong evidence that retrieval mechanics cannot be analyzed or benchmarked in isolation from their orchestration and delivery mechanisms. Several conclusions emerge:

Lexical retrieval remains highly competitive, often superior, for agentic QA tasks involving literal spans, even at noise levels previously thought to favor semantic retrieval.
Harness architecture (including prompt conditioning, context management, and tool-call formatting) can erase or dominate improvements attributed to retrieval algorithm changes.
Programmatic result delivery introduces workflow complexity that can degrade or invert retrieval benefits, especially if the agent fails to robustly close the file-access/search integration loop. Nonlinearity in performance curves is the norm, not the exception.
No universal scaling law exists for corpus noise: performance is mediated by agent backbone, harness, and stochastic session sampling; stable patterns in one stack do not necessarily transfer to others.

These observations mandate a shift in evaluation protocols for agentic systems: Reporting retrieval method alone is inadequate. Precise harness characterization and delivery-path specification are necessary to interpret end-to-end agent performance. Furthermore, blanket recommendations to "default to vector search at scale" are not universally supported; they must be conditioned on task domain, harness class, and agent capability.

Limitations and Future Directions

The study’s scope is focused on conversational QA over long dialogue transcripts with significant verbatim evidence. The demonstrated lexical advantage may attenuate (or reverse) in domains with high paraphrase, compositional semantics, or non-text modalities (e.g., code, scientific literature, multimodal artifacts), where the utility of semantic or hybrid retrieval will likely be more pronounced.

Potential avenues for future work include:

Comprehensive evaluation across additional vendor stacks and retrieval hybrids.
Extension to non-chat domains and more abstract evidence types.
Analysis of agent query/trace logs to isolate sources of harness-induced drift and interface brittleness.
Robust integration and evaluation of agent-driven retrieval-policy selection (meta-retrieval).

Conclusion

This work provides a critical empirical foundation for understanding retrieval–harness–agent interactions in agentic LLM workflows. The evidence demonstrates that retrieval strategy cannot be decoupled from agent orchestration and tool-calling architecture. Lexical retrieval remains a highly effective, sometimes dominant baseline, but its real-world efficacy is contingent upon the complex interplay of harness, backbone, tool presentation, and task structure. Future research evaluating and reporting agent retrieval must treat these dimensions as an inseparable system, driving the community toward more realistic, reproducible, and interpretable benchmarks and deployments.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

There was an error generating the whiteboard.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper looks at how AI “agents” (smart programs that can use tools and look things up for you) search for information while they work. It asks a simple question with a playful title: is a very basic keyword search (“grep,” like using Ctrl+F to find exact words) good enough, or do we always need fancy “find by meaning” searches (vector/semantic search)? The twist: the authors show that the agent’s setup—how it’s built and how it gets tool results—can change which search method works best.

What questions did the researchers ask?

They focused on three easy-to-understand questions:

Which search method works better for agents in practice: keyword matching (grep) or meaning matching (vector/semantic)?
Does the way search results are handed to the AI matter (pasted right into the chat vs. saved to a file the AI has to open)?
Do different “agent shells” (the software environment the agent lives in) change the outcome, even when the data is the same?

They also asked: what happens when the agent has to search through more and more irrelevant text (noise)?

How did they study it?

Think of the agent’s task as answering questions about long, messy conversations, like finding facts or dates scattered across many chat sessions.

The dataset: 116 questions from LongMemEval, a benchmark that tests long-term memory in chats (e.g., recalling preferences, dates, or details across many sessions).
The searches:
- Grep (keyword/regex search): like Ctrl+F—find exact words or patterns.
- Vector (semantic) search: like asking “find things with the same meaning,” even if the exact words differ.
The agent “harnesses” (their working environments):
- A custom-built setup called Chronos (tuned for time-related info).
- Provider-native command-line shells (CLI) like Claude Code (Anthropic), Codex (OpenAI), and Gemini CLI (Google). These let the AI run shell commands (like grep) directly.
Two ways of giving results to the model:
- Inline (standard): results are pasted into the conversation, right in front of the AI.
- Programmatic (file-based): results are saved to a file; the AI must open/read the file with commands like cat or grep.
Grading answers: A separate model (GPT-4o) checked whether each answer was correct, using the same rules across all tests.

They ran two experiments:

Compare grep vs. vector across different harnesses and delivery styles (inline vs. file-based).
Add more irrelevant text around the useful parts to see how each approach holds up as “noise” increases.

What did they find?

Here are the headline results, in plain language:

Experiment 1 (search method × agent setup × result delivery):
- When results were pasted inline, grep (keyword search) usually beat vector (semantic) search across all agent setups. This matched the idea that many answers in these chats depend on exact strings like dates, names, or counts.
- However, the agent harness (the agent’s “workspace and rules”) mattered a lot. The same model scored very differently in different harnesses. Changing the harness could shift accuracy as much as switching from grep to vector.
- When results were file-based (the AI had to open files), the ranking sometimes flipped: in several cases, vector search did better than grep. Why? Because file-based workflows add extra steps—if the agent struggles to open, read, and re-search files, accuracy can drop, even if the retrieval itself was good.
Experiment 2 (adding noise):
- As more irrelevant conversation history was mixed in, performance didn’t just steadily go down in the same way for both methods. Sometimes vector did better when there was less noise; sometimes grep caught up or overtook as noise grew.
- Which one led depended on the harness and model. In some provider CLIs, grep was consistently stronger; in others, vector was more stable. In short: there’s no one-size-fits-all winner as noise rises.

Why this is important:

Many real problems involve finding exact facts (dates, numbers, specific phrases). Simple keyword tools can be very effective in those cases—especially when results are shown inline.
But the “plumbing” of the agent—its harness and how results are delivered—can make a big difference. A setup that makes the agent jump through hoops (like opening files) can erase the edge of a good retriever.
You can’t judge retrieval methods in isolation; you have to look at the whole system working together.

What do these results mean?

Don’t always default to semantic/vector search. Simple lexical methods (like grep) can win when answers rely on exact phrases and when results are shown inline.
The agent harness is not just background—it shapes prompts, tools, and how results are displayed. Changing the harness can change outcomes as much as changing the search method.
How you deliver results matters. File-based delivery can help avoid crowding the model’s “short-term memory” (context window), but only if the agent reliably handles the extra steps.
Plan for noisy, messy data. Different methods degrade differently as noise grows. Test under realistic noise levels.
Hybrid approaches (using both keyword and semantic search) may be best, letting the agent choose based on the question.

Limitations to remember:

This benchmark is about long chat memories with lots of exact facts and dates. In fields where answers are more paraphrased or conceptual (e.g., scientific summaries), semantic search or hybrids may do better.
The paper doesn’t claim “grep is always best”—only that in these tasks and setups, it often performed better, especially with inline results.

The big takeaway

For agents that need to look things up in long conversations, sometimes the simplest tool (grep) is enough—or even better—especially when results are shown right in the chat. But the agent’s “workspace” and how it receives the results can change everything. If you’re building or judging an AI agent, don’t just pick a retriever in isolation: evaluate the whole loop—retrieval method, harness, and result delivery—together, and test under realistic levels of noise.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed to guide concrete follow-on studies.

Generalization beyond long-memory chat: The study is confined to long-term conversational QA with many literal spans; it remains unclear how results transfer to domains where evidence is paraphrased or compositional (e.g., scientific synthesis, code semantics, multimodal documents, or enterprise document QA).
Hybrid retrieval policies untested: Only grep-only and vector-only conditions are compared. There is no evaluation of hybrid approaches (e.g., BM25/grep + dense via RRF, sparse-dense late interaction, or agent policies that select or fuse retrievers per-query).
Lexical baseline limited to grep: Stronger lexical baselines (BM25, SPLADE, learned sparse models) are not included, leaving it unclear whether the observed “grep advantage” persists under competitive lexical methods.
Dense retrieval configuration sensitivity: The paper does not detail or ablate embedding model choice, index type/parameters (e.g., HNSW/IVF), dimensionality, chunking granularity, or reranking models/settings. It is unknown how sensitive outcomes are to these design choices.
Confounding initial context in Chronos: Chronos seeds each episode with a top-15 vector-results block even in “grep-only” runs, creating a potential hybrid/priming confound. An ablation removing or symmetrizing this seed is needed to isolate pure retriever effects.
Harness effects not decomposed: Large performance shifts across harnesses are observed, but the causal factors (system prompts, tool schemas/descriptions, result formatting, iteration policy, error handling) are not isolated. Controlled ablations are needed to pinpoint which harness components drive the differences.
Programmatic (file-based) failure modes uninstrumented: The study shows programmatic delivery can invert outcomes but does not instrument step-wise tool-usage success (file creation, path resolution, reads, partial reads, re-searches) or error types. Fine-grained telemetry is needed to attribute accuracy drops to specific sub-steps.
No measurement of token budgets and context pressure: Although “context pressure” is discussed, the study does not report token counts per turn, truncation rates, or the fraction of tokens consumed by tool outputs across conditions. Quantifying these would clarify when file-based delivery helps.
Latency and cost left unreported: There is no accounting of runtime, number of tool calls, token usage, embedding/indexing costs, or wall-clock latency per question. Practical deployability remains unassessed.
Statistical robustness and significance: The paper reports accuracy percentages without confidence intervals, statistical tests, or paired bootstraps. It is unknown whether observed differences are statistically reliable given the task size and grading noise.
Grader dependence and validity: A single LLM grader (GPT-4o) is used with fixed prompts. The sensitivity of results to grader choice, prompt variations, or human adjudication is not evaluated.
Dataset scope and selection: Experiments use a 116-question subset of LongMemEval; selection criteria and representativeness are not detailed. It’s unknown whether results hold on the full benchmark or other long-memory datasets.
Per-question indices reduce realism: Vector indices are built per-question rather than over a single large, shared corpus. This deviates from realistic deployments and may understate cross-query interference and scaling issues.
Noise scaling via resampled distractors: In Experiment 2, distractors are resampled at each session limit, adding stochasticity. The study does not quantify variance across resamples or provide fixed-seed replications to separate noise effects from sampling variance.
Incomplete experimental grid: Some rows (e.g., Codex vector intermediates in Experiment 2) are missing, preventing a complete cross-harness picture of noise scaling.
Lack of action-level behavioral analysis: The paper does not analyze agent traces (query strings, grep flags, retry counts, stop criteria) to categorize failure modes (e.g., grep vocabulary mismatch vs. dense topical drift). Such a taxonomy would make findings actionable.
No adaptive retriever-choice agent: Agents are not allowed to choose between grep and vector at runtime. A key open question is whether an agent can learn reliable policies for retriever selection conditioned on query type and harness/tool-calling mode.
Category-level effects only partially explored: Per-category breakdown is provided for a single condition (Chronos, grep-only, inline). It remains unresolved which categories systematically benefit from lexical vs. dense retrieval across harnesses and delivery modes.
Baseline without retrieval missing: The study lacks a “no retrieval” control to estimate the absolute value added by each retrieval strategy and harness configuration.
Robustness to paraphrase and query phrasing: There is no targeted evaluation under paraphrase-heavy queries or synonym substitutions to probe grep brittleness vs. dense robustness.
Multilingual and cross-domain generalization: Only (presumably) English chat data are evaluated. It is unknown how retrieval choices and harness effects change in multilingual settings or non-chat corpora.
Context window and model-size dependencies: The impact of model context length, memory mechanisms, and backbone size on the retrieval-mode ranking is not systematically varied or analyzed.
Provider CLI opacity and reproducibility: Provider-native CLIs are treated as black boxes; internal prompts, tool heuristics, and buffering policies are unknown. Reproducible open-source harness facsimiles or released transcripts/prompts would help isolate provider-specific effects.
Security and reliability of tool use: The study does not evaluate robustness to tool execution errors, malformed outputs, or adversarial file contents—factors that can differentially affect inline vs. file-based delivery.
Scaling to substantially larger corpora: The “full haystack” comprises 39–66 sessions per item—small relative to real deployments. It remains open how lexical vs. dense retrieval (and programmatic vs. inline delivery) behave at orders-of-magnitude larger scales with shared indices.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The paper’s results suggest several deployable practices and workflows that can improve agentic RAG systems today, especially for long-memory, chat-centric tasks where answers are licensed by literal spans (dates, counts, preferences).

Software and DevOps (software)
- Grep-first agent workflows for logs and chat transcripts
- Use inline lexical search (grep/BM25/regex) as the default retriever for incident response, SRE postmortems, and chatops assistants where exact strings (timestamps, error codes, commit SHAs) matter.
- Tools/products/workflows: “grep tool” exposed to the agent; provider CLI agents with native shell access; custom harnesses that render grep hits inline.
- Assumptions/Dependencies: Evidence appears verbatim; secure sandboxing for shell tools; large result sets fit in context or are summarized.
- Harness-aware RAG configuration
- Evaluate retrieval choices per agent harness (custom vs provider CLI); adopt the harness where your backbone model and tool-calling style perform best for your corpus.
- Tools/products/workflows: A/B harness testing harness (Chronos-style vs provider CLIs); CI checks that validate retrieval accuracy per harness.
- Assumptions/Dependencies: Same corpus across harnesses; consistent grading/evaluation; provider CLI stability.
Customer Support and CRM (software, customer operations)
- Conversation memory assistants using grep for literal preference/state recall
- Deploy lexical retrieval over long chat histories to answer “what did the user say earlier?” or “what date did they agree to?” reliably.
- Tools/products/workflows: Inline grep over session JSON; category-conditioned prompting (e.g., preference vs knowledge-update) to guide tool use.
- Assumptions/Dependencies: Chat logs are accessible and normalized; privacy controls on PII; LongMemEval-like tasks where literal spans suffice.
Finance, Audit, and Compliance (finance, legal)
- Evidence retrieval from audit trails and communications
- Use inline lexical retrieval to surface exact mentions in email/chat/ticket systems for audits, KYC/AML checks, or regulatory inquiries.
- Tools/products/workflows: Agent with grep + simple regex patterns for IDs and dates; file-system sandbox with per-matter corpora; inline result rendering for quick adjudication.
- Assumptions/Dependencies: Data custodianship and redaction in place; chain-of-custody logging; text is not heavily paraphrased.
Legal and eDiscovery (legal)
- High-precision cull and review
- Employ grep and learned sparse methods for first-pass culling where exact phrases or names are authoritative; escalate to vector search for paraphrase-heavy items as a secondary tool.
- Tools/products/workflows: Dual-tool panel; reciprocal rank fusion (RRF) optional for tie-breaking; inline display for reviewer triage.
- Assumptions/Dependencies: Index quality; predictable naming/terminology; governance for shell tool execution on case data.
Healthcare Operations (healthcare)
- Chart review for exact dosages/dates in clinical notes
- Use lexical retrieval to extract explicit temporal and dosage spans for scheduling, refills, or quality reporting; pair with structured temporal event extraction (Chronos-like) to normalize time mentions.
- Tools/products/workflows: Preprocessing pipeline to extract/normalize temporal events; inline display of grep hits with surrounding context.
- Assumptions/Dependencies: PHI controls; domain vocab varies—lexical may miss synonyms; clinical paraphrase tasks may still need dense retrieval.
Education and Student Support (education)
- Recall of deadlines, policies, and prior feedback
- Student or advisor assistants that retrieve literal course dates, policy clauses, or past comments from LMS chats and syllabi using grep.
- Tools/products/workflows: On-device/local grep over exported histories; category prompts for temporal reasoning vs preference recall.
- Assumptions/Dependencies: Access to LMS exports; performance degrades if evidence is paraphrased or buried in images/PDFs without OCR.
Cost and Infrastructure Optimization (software, finance/ops)
- Reduce embedding spend where lexical suffices
- Default to grep for long-memory tasks with literal cues; only pay for embeddings when paraphrase coverage is needed or programmatic routing favors vector.
- Tools/products/workflows: Policy that routes “literal-span” queries to lexical; budget dashboards comparing embedding costs vs accuracy.
- Assumptions/Dependencies: Clear query taxonomy; monitoring to detect missed paraphrases.
Observability and QA for Agents (software, MLOps)
- Harness + delivery-path benchmarking as a release gate
- Incorporate the paper’s insight that harness and delivery (inline vs file) materially change outcomes; test both before deployment.
- Tools/products/workflows: “Harness Auditor” that runs LongMemEval-like suites across harness/delivery permutations; logs decision traces.
- Assumptions/Dependencies: Stable evaluation rubric; reproducible corpora; access to multiple provider CLIs.
Safety, Privacy, and Governance (policy, software)
- Programmatic/file-based delivery to control context exposure
- For privacy-sensitive corpora, route large results to files and have the agent read targeted slices, limiting token spillage into prompts.
- Tools/products/workflows: File-based result buffers with access control; agent policies to “read only what you need” via grep-on-files.
- Assumptions/Dependencies: Reliable multi-step tool use; added latency; failure modes if the agent cannot complete read-integrate cycles.

Long-Term Applications

Building on the paper’s findings, several directions require further research, scaling, or productization before broad deployment.

Adaptive Hybrid Retrieval Policies (software, research)
- Learned switches between lexical and dense retrieval conditioned on harness, backbone, delivery mode, and corpus noise.
- Potential products: “Retrieval Switchboard” that chooses grep vs vector (or combines via RRF/ColBERT) per query and environment.
- Assumptions/Dependencies: Online telemetry on retrieval success; reinforcement learning or bandit frameworks; explainability for audit.
Harness-Optimizing Agent Orchestration (software platforms)
- Meta-layer that tunes prompts, tool descriptions, and delivery paths for each provider CLI or custom harness to maximize end-to-end accuracy.
- Potential products: “Harness Optimizer” that auto-generates harness-specific tool ergonomics and prompt templates.
- Assumptions/Dependencies: Access to provider internals is limited; needs data-driven search over prompt + tool-calling patterns.
Robustness-to-Noise Toolchains (software, academia)
- Retrieval and orchestration that remain stable as irrelevant content scales; noise-aware index maintenance and query rewriting.
- Potential products: “Noise Robustness Evaluator” and remedial strategies (e.g., multi-hop lexical patterns, filter chains before vector search).
- Assumptions/Dependencies: Synthetic and real noise models; cross-domain benchmarks beyond LongMemEval.
Domain-General Structured Event Extraction (healthcare, finance, legal, software)
- Chronos-like pipelines for temporal and event normalization across domains (e.g., medication timelines, trade events, contract clauses).
- Potential products: Event indexers integrated with agents; temporal-aware retrieval layers for EHRs, ledgers, and DMS repositories.
- Assumptions/Dependencies: High-accuracy extraction models; domain ontologies; validation against gold standards for safety-critical use.
Vendor-Agnostic CLI Abstractions (software tooling)
- Standard interface that exposes grep-like primitives safely across providers, reducing portability gaps observed between CLIs.
- Potential products: “CLI Agent SDK” with normalized tool semantics and transcript formatting for consistent retrieval behavior.
- Assumptions/Dependencies: Provider cooperation; security review for shell command exposure.
Evaluation and Reporting Standards (policy, academia, industry consortia)
- Require reporting of retrieval mechanics, harness type, and delivery path in benchmarks and system cards to avoid misleading comparisons.
- Potential outputs: Standards in AI system documentation; regulatory guidance for public-sector procurement of RAG/agent systems.
- Assumptions/Dependencies: Community buy-in; alignment with existing AI transparency frameworks.
Privacy-Preserving Programmatic Retrieval (policy, healthcare, finance)
- Differential privacy or secure enclaves for file-based result routing to minimize data leakage in long-context agents.
- Potential products: Secure buffer managers and encrypted temporary stores; audit trails for file access by agents.
- Assumptions/Dependencies: Secure execution environments; performance overhead tolerance; compliance approvals.
Training-Time Co-Design of Retriever and Harness (research)
- Jointly optimize dense retrievers with harness-specific prompts and delivery styles, or train models to be better “file navigators.”
- Potential products: “Harness-conditioned retrievers” and curricula that teach read/open/parse cycles reliably.
- Assumptions/Dependencies: Access to fine-tuning data; risk of overfitting to a harness; transfer learning challenges.
Sector-Specific Agent Products
- Legal: eDiscovery agents combining event extraction and hybrid retrieval for clause evolution across drafts.
- Finance: Audit agents that reconcile ledger events with email approvals using temporal indices plus lexical confirmation.
- Healthcare: Care coordination agents that align orders, labs, and appointments via structured timelines and targeted retrieval.
- Assumptions/Dependencies: Strong governance and validation; integration with existing systems (DMS, EHR, ERP); domain adaptation.
Human-in-the-Loop Retrieval Debuggers (software, UX)
- Interfaces that expose search traces, show grep/vector hits, and let SMEs steer query reformulations or choose delivery modes mid-run.
- Potential products: Agent “flight recorder” + interactive reranker; teaching LLMs high-precision regex patterns from expert feedback.
- Assumptions/Dependencies: Usable provenance displays; latency budgets; logging policies.

Cross-Cutting Assumptions and Dependencies

Task fit: The strongest gains are for long-memory conversational QA with literal evidence; domains requiring heavy paraphrase understanding may benefit more from dense or hybrid retrieval.
Tool competence: Programmatic/file-based gains depend on reliable multi-step tool use; weak models may falter despite good retrieval.
Provider idiosyncrasies: Differences across CLI harnesses materially affect outcomes; portability requires testing and adaptation.
Data governance: Exposing grep/shell tools to agents must be sandboxed and auditable, especially for regulated data.
Cost-performance trade-offs: Lexical reduces embedding costs but may miss paraphrases; dense retrieval adds infra and latency.

These applications translate the paper’s core findings—“lexical often wins with inline delivery, but harness and delivery path can invert outcomes”—into concrete deployment choices, product ideas, and research agendas across sectors.

View Paper Prompt View All Prompts

Glossary

Agent harness: The environment layer that orchestrates an agent’s tools, prompts, and iteration to solve tasks. "The agent harness is the environment layer that manages the tool- calling loop: it constructs the prompt, dispatches tool calls, receives results, and decides whether to continue iterating or produce a final answer."
Agentic retrieval: Retrieval conducted by an agent that iteratively decides what and how to search based on intermediate results. "agentic retrieval is iterative and agent-directed: the model decides what to search for, how many queries to issue, and whether the retrieved results are sufficient or require refinement"
Agentic search: Search performed within agent-driven workflows that interleave tool use and reasoning. "Despite growing adoption of agentic search [1, 7, 24]"
Approximate nearest-neighbor (ANN) search: A fast technique to find vectors closest to a query vector in high-dimensional spaces. "the most common way being approximate nearest-neighbor (ANN) search [8]."
BEIR benchmark: A widely used benchmark suite for evaluating information retrieval models across diverse tasks. "The BEIR benchmark demonstrated that BM25 remains a com- petitive baseline across diverse retrieval tasks, often outperforming early dense retrieval models in zero-shot settings [23]."
BM25: A classical term-matching scoring function for lexical retrieval based on term frequency and inverse document frequency. "Classical approaches such as BM25 [11] score documents by term frequency and inverse document frequency"
Category-conditioned instructions: Evaluation or prompting guidance tailored to a question’s category. "under category-conditioned instructions (e.g., tolerance for off-by-one temporal counts, rubric-style scoring for preference items, and abstention handling for _abs variants)."
Chronos: A custom agent harness and preprocessing pipeline emphasizing structured temporal event retrieval. "Our custom harness, Chronos, implements an agent using LangChain with access to four search tools (grep and vector search over turns and events)."
ColBERT: A late-interaction retrieval model computing fine-grained token-level similarities for efficient and effective search. "Late interaction models such as ColBERT [9] compute fine-grained token-level similarity between query and document representa- tions, achieving a middle ground between the efficiency of single- vector retrieval and the expressiveness of cross-encoder reranking."
Context engineering: Design choices that shape how context is constructed and fed to the model for better performance. "leverage the provider's optimized context engineering"
Context pressure: Competition for limited context window space among prompts, history, and results, which can degrade performance. "creating context pressure - a phenomenon sometimes called context rot - that can degrade performance on long-horizon tasks [12]."
Context rot: Degradation in model performance as relevant information is crowded out of the context window over long interactions. "creating context pressure - a phenomenon sometimes called context rot - that can degrade performance on long-horizon tasks [12]."
Context window: The maximum amount of text a model can attend to at once. "enabling them to reason over corpora that far exceed their context windows."
Cross-encoder reranking: A reranking method where a cross-encoder scores query-document pairs for higher-quality ordering after initial retrieval. "the expressiveness of cross-encoder reranking."
Custom harnesses: Developer-built agent environments offering fine-grained control over prompts, tools, and iteration. "Custom harnesses are built by devel- opers using agent frameworks, provider open SDKs, or custom code [22, 29]."
Dense Passage Retrieval (DPR): A dual-encoder retrieval approach that retrieves by semantic meaning rather than exact terms. "Dense passage retrieval (DPR) established this paradigm by training dual encoders on question-passage pairs, enabling retrieval based on meaning rather than surface-level term overlap [8]."
Dense retrieval: Retrieval that uses dense vector embeddings to match queries and documents semantically. "The information retrieval community has extensively benchmarked lexi- cal and dense retrieval methods [4, 13, 23]"
Distractor sessions: Irrelevant sessions mixed with oracle evidence to increase noise in the retrieval corpus. "and a variable number of distractor sessions that are irrelevant to the query."
Dual encoders: A model architecture with separate encoders for queries and documents used in dense retrieval. "by training dual encoders on question-passage pairs"
Dynamic prompting: Adapting the system prompt and guidance based on the detected task or question category. "we use dynamic prompt- ing: the system instructions, search hints, and tool-use guidance depend on the detected question category"
Grep: A lexical search tool that matches text via substrings or regular expressions. "grep search uses regular expressions or substring matching to locate passages containing specific key- words [14]."
Hybrid retrieval: Combining lexical and semantic retrieval signals or results to improve effectiveness. "Hybrid retrieval combines lexical and semantic signals to leverage the strengths of both paradigms."
LangChain: A framework for building LLM-powered agents and tool-using workflows. "Our custom harness, Chronos, implements an agent using LangChain"
Late interaction models: Retrieval models that compare query and document at token-level granularity after independent encoding. "Late interaction models such as ColBERT [9] compute fine-grained token-level similarity"
Latent space: The vector space where embeddings place semantically similar items close together. "into a shared latent space for approximate nearest-neighbor matching [8]"
Lexical search: Retrieval based on exact or pattern-based term matching over raw text. "lexical search (e.g., grep, BM25, regex)"
LongMemEval: A benchmark for evaluating assistants on long-term memory across multi-session conversations. "We evaluate on a 116-question representative subset of the Long- MemEval benchmark [27]"
Oracle sessions: Sessions that contain the necessary information to answer a question correctly. "one or more oracle sessions containing the information needed to an- swer correctly"
Programmatic (File-Based): A tool-calling mode that writes results to files, which the agent must explicitly read or process. "In programmatic tool-calling ar- chitectures, search results are written to disk and the model receives only a file path or summary pointer [14, 16]."
Provider-native CLI harnesses: Vendor-supplied command-line agent environments with built-in tool execution capabilities. "Provider-native CLI harnesses embed tool calling into a shell-based interface where the model has direct access to system utilities [2, 28]."
Reciprocal Rank Fusion (RRF): A method to combine ranked lists from different retrievers without calibrating scores. "Recip- rocal rank fusion (RRF) [3, 6] merges ranked lists from independent lexical and dense retrievers without requiring score calibration."
Retrieval-Augmented Generation (RAG): Systems that retrieve external knowledge to condition and improve generation at inference time. "retrieval-augmented generation (RAG) in agentic systems"
ReAct paradigm: An agent prompting strategy interleaving reasoning steps with tool actions. "The ReAct paradigm [29], which interleaves reasoning traces with tool actions, is the most widely adopted pattern for custom harnesses."
Reranking: Reordering initially retrieved candidates using a secondary scoring model to improve final result quality. "with optional post-retrieval reranking to refine the initial candidate set [5, 26]."
Semantic vector search: Embedding-based retrieval matching queries and documents by meaning rather than exact terms. "semantic vector search, which embeds queries and documents into a shared latent space for approximate nearest-neighbor matching [8]"
SPLADE: A learned sparse lexical model that expands terms to bridge exact matching and semantic understanding. "Learned sparse representations such as SPLADE [4] extend lexical match- ing by expanding query and document terms through a learned vocabulary"
Standard (Inline): A tool-calling mode where results are injected directly into the model’s conversation context. "In standard tool-calling architectures, search results are returned directly as tool response messages appended to the conversation context [17, 19, 20]."
Tool-calling loop: The iterative process by which an agent issues tool calls, consumes results, and decides whether to continue. "manages the tool- calling loop: it constructs the prompt, dispatches tool calls, receives results, and decides whether to continue iterating or produce a final answer."
Vector index: A data structure that stores embeddings to support efficient nearest-neighbor retrieval. "it introduces dependencies on embedding model quality, vector index infrastructure, and indexing latency that lexical methods avoid."
Vector retrieval: Retrieval that uses vector embeddings (dense representations) to find semantically similar passages. "Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval"

View Paper Prompt View All Prompts

Open Problems

Vendor-complete characterization of CLI grep vs. vector performance under increasing distraction

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Summary

Agent Harnesses and Retrieval in Agentic Search: Empirical Insights

Introduction

Retrieval Strategies and Harness Architectures

Experimental Methodology

Empirical Findings

Retrieval–Harness Interactions

Corpus Noise and Retrieval Robustness

Practical and Theoretical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they study it?

What did they find?

What do these results mean?

The big takeaway

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-Cutting Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Summary

Agent Harnesses and Retrieval in Agentic Search: Empirical Insights

Introduction

Retrieval Strategies and Harness Architectures

Experimental Methodology

Empirical Findings

Retrieval–Harness Interactions

Corpus Noise and Retrieval Robustness

Practical and Theoretical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they study it?

What did they find?

What do these results mean?

The big takeaway

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-Cutting Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research