Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
Abstract: Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.
First 10 authors:
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about a new way for AI “researcher” agents to look things up in big collections of documents. Instead of using a traditional search tool that returns a small list of “top results,” the authors let the agent dig directly through the files using simple command-line tools (like super-powered “find” and “search” commands). They call this approach Direct Corpus Interaction (DCI). The big idea: when an agent can reason in many steps, giving it a higher‑resolution, hands‑on way to search and verify facts can work better than a fixed “top‑k” search interface.
What questions did the researchers ask?
They focused on a few simple questions:
- Can an AI agent do better research if it can search a document collection directly, instead of always going through a traditional search engine that only shows the top few hits?
- Where do any improvements come from? Is it because the agent finds more relevant documents, or because it uses the documents it finds more precisely?
- How well does this direct-search approach work across different tasks (like answering multi-step questions, ranking documents, and running long research sessions)?
- What are the trade-offs in cost, speed, and performance, especially as the document collection grows bigger?
How did they study it?
To make this easy to understand, think of two ways to find information in a huge library:
- Traditional retrieval: You ask a librarian, “Give me the top 5 books about X.” You only see those. If the clue you need isn’t in them, you might miss it. In tech terms, this is a “top‑k” interface, often powered by a “retriever” that measures “semantic similarity” (how related things are) using embeddings and indexes.
- Direct Corpus Interaction (DCI): You go into the stacks yourself with a flashlight and sticky notes. You skim shelves, search for exact words, chain your searches (like “must mention both ‘report’ and ‘2024’”), peek at a paragraph, then refine your search and keep going. In tech terms, the AI agent uses simple tools like:
- grep/rg: like “Ctrl+F across all files” or searching with patterns
- find/ls/glob: to list and locate files and folders
- head/tail/sed/cat: to peek at or read parts of files
- tiny scripts: to combine or filter results
They built two versions of a DCI agent:
- DCI-Agent-Lite: a minimal setup that only uses bash and file reads. It keeps things simple to show the core idea.
- DCI-Agent-CC: a stronger version using a more capable coding agent to better manage long searches.
Because long sessions can produce tons of text, they added basic “context management” so the AI doesn’t get overwhelmed by its own notes:
- Truncation: cut long tool outputs to a safe size
- Compaction: replace old, bulky tool outputs with small placeholders
- Summarization: occasionally summarize older history to free space
They then tested DCI against strong traditional retrieval systems on:
- Agentic deep research (BrowseComp-Plus): long, multi-step problems
- Multi-hop question answering: questions that need combining clues from several documents
- Information retrieval ranking benchmarks: ordering documents by relevance
They also created two simple process metrics:
- Coverage: Did the agent surface the right documents at all?
- Localization: Once it found a useful document, did it quickly zoom into the small section that actually contains the answer?
What did they find?
Here are the main results and why they matter:
- Better accuracy at lower or similar cost for complex research
- On a tough research benchmark (BrowseComp-Plus), switching from a standard retriever to DCI with the same base AI model boosted accuracy from about 69% to 80% and reduced API cost by roughly 30%.
- A minimal, cheaper DCI setup still stayed very competitive while saving a lot of money.
- Big gains on multi-step QA and ranking
- On multi-hop question-answering (where you must chain clues), the DCI agents outperformed strong retrieval-based agents by large margins (average accuracy around 83% vs ~52% for the best baseline reported).
- On ranking tests, the DCI approach also scored much higher on standard measures (like NDCG@10).
- Why it works: higher “resolution” access to evidence
- The agent didn’t just find more relevant documents; it used the ones it found more effectively. DCI made it easy to:
- Combine exact clues (e.g., “must include A and B”).
- Check the local context around a hit (e.g., read the nearby lines to confirm meaning).
- Chain searches step by step (e.g., grep → peek at lines → grep again with refined terms).
- In short, DCI improved “localization”—finding the exact passages that matter—so the agent could make progress even if it had only some of the relevant documents.
- A small toolset already goes far
- Even with just “read + grep” (reading files and exact/pattern search), the agent beat strong retriever baselines by a wide margin. Adding full bash tools helped more, but the core benefit appeared early.
- Clear trade-offs as collections grow
- DCI shines once the agent reaches at least one useful document. But if the corpus gets much larger, the initial “finding that first anchor” can get slower and costlier. Accuracy dropped and tool calls increased as they doubled and quadrupled the number of documents. So DCI is powerful, but breadth-first exploration becomes expensive at very large scales without extra help.
- Right-sized memory strategies matter
- The best results came from context policies that “selectively forget” or summarize older details while preserving structure, showing that good housekeeping helps long searches stay focused.
Why this matters
- Rethinking the “search box” for smart agents
- As AI agents grow better at planning and reasoning, a fixed “top‑k results” pipe can hold them back. This paper argues that we should design the interface to the documents—not just the search model. Giving agents high‑resolution, flexible tools to touch the raw text (DCI) unlocks better step‑by‑step reasoning, verification, and clue-chaining.
- Practical benefits
- DCI avoids building special indexes or embeddings, which is handy for local, changing collections of files (like company docs or codebases). It’s also cost‑effective and easy to set up with standard tools.
- Balanced view
- Traditional sparse/dense retrieval is still great for huge, stable datasets when you need fast global search. But for agentic tasks that require careful, multi-step thinking, DCI’s finer‑grained interface can deliver better results.
In short, the paper shows that for AI agents that think in steps, the way they “touch” the information matters a lot. Letting them interact directly with the raw corpus—like a person browsing the shelves with a highlighter—often beats only seeing the top few search results. This shift from “which retriever is best?” to “what interface helps the agent reason best?” could change how we build future research and question‑answering systems.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to guide actionable follow-up research.
- Scalability to large corpora: DCI accuracy and efficiency degrade sharply as corpus size grows (e.g., −13.6 points at 200K docs and ~37.5% at 400K, with >3× tool calls); methods to retain high “interface resolution” while locating initial anchors efficiently at million–billion document scales remain open.
- Hybrid architectures: How to combine coarse, scalable indexing (sparse/dense) to find first anchors with DCI for fine-grained follow-up, including policies for interface switching, uncertainty-aware handoffs, and learned selection of resolution levels.
- First-anchor problem: Strategies for reliably finding an initial relevant document without exhaustive grep (e.g., lexical expansion, PRF, stratified sampling, progressive narrowing) are not developed or evaluated.
- Generalizable context management: Truncation/compaction/summarization yields non-monotonic effects; principled, adaptive memory policies (e.g., learned compression, external memory, structured state) that generalize across tasks and models are unstudied.
- Model capability dependence: DCI relies on agent planning/execution strength; a systematic study of performance vs. model size/architecture/training (and the minimal capability needed for DCI to outperform retriever pipelines) is missing.
- Efficiency accounting: Reported “cost” emphasizes API charges, not local compute/IO, energy, or throughput; end-to-end resource profiling on commodity vs. server-class hardware is needed to compare DCI and indexing-based pipelines fairly.
- Ranking methodology transparency: Details on how DCI generates ranked lists for NDCG@10 (e.g., scoring, tie-breaking, rerun consistency) are sparse; standardized, reproducible ranking protocols for agentic DCI are needed.
- Interface fairness in comparisons: Baselines use conventional top‑k APIs; comparisons to high-resolution retrieval APIs (boolean/regex filters, fielded queries, span highlighting) are absent, leaving unclear how much gain stems from interface granularity vs. bypassing indexes.
- Localization metric validity: The new “localization” metric depends on implementation choices (snippet extraction, segment size Cseg); its correlation with human-judged evidence quality and robustness to parameterization require validation.
- Heterogeneous formats: DCI is demonstrated on text-like corpora; handling PDFs (layout noise), tables, images, code binaries, and multi-modal documents (OCR/parsing, structural indexing) remains unexplored.
- Multilingual and Unicode robustness: Grep-style matching across scripts, normalization (e.g., diacritics, case-folding), and tokenization differences in multilingual corpora is unaddressed.
- Robustness to noise/adversaries: Sensitivity of DCI to duplicated or misleading lexical patterns, adversarial obfuscation (e.g., spacing, homoglyphs), and extremely frequent terms has not been analyzed; defenses and detection are open.
- Tool expressivity vs. performance frontier: While “read+grep” already helps and bash adds more gains at higher cost, the minimal toolset that achieves near-peak performance (and how to learn or tailor it by domain) is not characterized.
- Long-horizon latency and UX: Trajectories up to 300 turns imply high latency; anytime behaviors, early-exit criteria, caching/replay, and human-in-the-loop controls for practical deployments need design and evaluation.
- Safety and sandboxing: Executing shell commands over local corpora raises risks (resource exhaustion, unintended writes, data exfiltration, PII exposure); standardized sandboxes, quotas, and audit trails for DCI are not specified.
- Distributed/remote storage: Adapting DCI to networked or distributed corpora (latency, bandwidth, partial reads, caching layers) and evolving datasets (change detection, incremental scans) is an open systems question.
- Reproducibility across environments: Sensitivity to OS/shell versions, grep/rg differences, locale, and file encodings is not documented; a canonical toolchain and determinism controls are needed for reproducible results.
- Corpus pre-processing assumptions: Although “no indexing” is claimed, practical success may depend on text normalization (JSONL layout, line breaks, deduplication); how pre-processing choices affect DCI efficacy is not studied.
- Benchmark breadth and realism: Evaluations focus on BrowseComp‑Plus, 2018 Wikipedia, and selected BRIGHT/BEIR tasks; generality to real web browsing, enterprise corpora, and larger/updated datasets remains untested.
- Data contamination controls: Using strong general LLMs raises memorization concerns (especially on Wikipedia); decontamination and leakage checks comparable across DCI and retriever baselines are not reported.
- Learning to use tools: The work relies on prompting; training (e.g., RL, imitation learning) to optimize DCI tool policies, improve sample efficiency, and reduce tool calls is a promising but unexplored direction.
- Adaptive interface resolution: Formal mechanisms for agents to modulate evidence granularity (document vs. span) and reason about cost/benefit of higher resolution at each step are not developed.
- Error taxonomy: Beyond coverage vs. localization, a fine-grained taxonomy of DCI failure modes (missed anchors, regex misfires, over-narrowing, context drift) and targeted mitigations is missing.
- Privacy and compliance: Direct raw-corpus access complicates governance (GDPR/CCPA, access controls, redaction); frameworks for policy enforcement within DCI workflows are unspecified.
- Caching and reuse: Opportunities for result caching (e.g., grep indices, fingerprinted spans) that preserve “index-free” semantics while accelerating repeated queries are not explored; trade-offs versus building lightweight indices need study.
- Interface generalization: Whether the “retrieval interface resolution” concept extends to other environments (databases with SQL, APIs with filterable endpoints) and how to measure resolution consistently across media is open.
Practical Applications
Immediate Applications
The following applications can be deployed today using the paper’s Direct Corpus Interaction (DCI) paradigm and its minimal toolset (bash, grep/rg, find/glob, simple scripts), often without prebuilt indices or embedding models. They are most effective on local, heterogeneous, evolving corpora where exact lexical constraints and span-level verification matter.
- Enterprise document search copilots for compliance and audit
- Sectors: finance, legal, enterprise IT, government
- What: An agentic search assistant that scans internal file shares, policies, contracts, and emails exported as text/PDF-to-text to enforce exact clauses, dates, and entity conjunctions (e.g., “Section 7.2” AND “indemnify”) and verify in-file context.
- Tools/workflow: grep/rg pipelines, find + head/tail/sed for local context; runtime context management (truncation/compaction/summarization) to keep trails concise; export PDFs via pdftotext/OCR when needed.
- Assumptions/dependencies: Unix-like shell or WSL; secure sandboxing and least-privilege FS access; corpora small-to-mid scale or partitioned; conversion for PDFs/scanned docs.
- E‑discovery and FOIA response triage
- Sectors: legal, public sector
- What: Rapid filtering and evidence localization across custodial data to surface responsive documents and pinpoint quotations/spans for review.
- Tools/workflow: chained grep with exact constraints and regex; hit localization logs to support explainability and chain-of-custody.
- Assumptions/dependencies: Text normalization of mixed formats; policy-compliant logging; human-in-the-loop validation.
- Software engineering: code localization and patch planning
- Sectors: software, DevOps
- What: Agents that find functions, configs, or API uses via exact/regex search; perform local context peeks to craft minimal patches; triage issues.
- Tools/workflow: rg/grep over repos, file reads, sed/head/tail; integration with tools like OpenHands/Aider or editors; optional test execution harness.
- Assumptions/dependencies: Read/write sandbox with repo access; test runner or CI for verification; guardrails for code changes.
- DevSecOps scanning for secrets/CVEs/misconfigurations
- Sectors: cybersecurity, cloud/infra
- What: Pattern-driven scanning and verification for secrets, vulnerable dependency strings, and policy violations in code and IaC.
- Tools/workflow: regex libraries + grep pipelines; localized evidence extraction for tickets; aggregation (wc) for counts.
- Assumptions/dependencies: Up-to-date rule/pattern sets; scope control to manage breadth; integration with SIEM/ticketing.
- PII/PHI discovery and data governance
- Sectors: healthcare, finance, retail
- What: On-prem scanning to identify and verify PII/PHI occurrences in data lakes, shared drives, or exports to enforce data minimization.
- Tools/workflow: regex catalogs for identifiers; PDF-to-text/OCR; span-level logging for remediation.
- Assumptions/dependencies: Robust pattern sets to minimize false positives; privacy-preserving execution; format conversions.
- Healthcare: on-prem clinical knowledge search under strict privacy
- Sectors: healthcare IT
- What: Agentic assistants searching local guidelines, SOPs, and logs to answer multi-hop clinical operations questions without uploading PHI.
- Tools/workflow: grep/find across local corpora; context management to fit within EHR-side compute limits.
- Assumptions/dependencies: Air‑gapped or on‑prem deployment; medical compliance; text access (no external API for sensitive data).
- Academic research assistants for literature and notes
- Sectors: academia, R&D
- What: Multi-hop literature curation over local PDFs/notes; verify claims/citations by exact string matches and span localization.
- Tools/workflow: pdftotext + grep/rg; chaining (grep | grep) to combine weak clues; export evidence snippets to notes.
- Assumptions/dependencies: Text extraction quality; bibliography formats vary; researcher oversight.
- Journalism/OSINT document digging
- Sectors: media, NGOs
- What: Rapid, precise digging through leaks/archives with exact names, dates, and entity conjunctions; verification of quotes in context.
- Tools/workflow: grep pipelines with regex; head/tail for quick peeks; audit logs of queries and hits.
- Assumptions/dependencies: Mixed-format normalization; legal/ethical review; security hygiene.
- Customer support knowledge-base copilot
- Sectors: SaaS, IT support
- What: Agent that searches internal wikis/runbooks with exact product/version constraints and confirms answers from localized snippets.
- Tools/workflow: grep over Markdown/HTML exports; tool-result truncation to keep costs low; answer templates with evidence spans.
- Assumptions/dependencies: Up-to-date knowledge base exports; access control to internal docs.
- Edge/offline assistants (field operations)
- Sectors: energy, manufacturing, defense
- What: On-device agents that query local manuals/logs in constrained or disconnected environments (ships, rigs, plants).
- Tools/workflow: preloaded corpora; grep/find; lightweight summarization to manage small contexts.
- Assumptions/dependencies: Compute-constrained deployment; text-available manuals (OCR if needed); safety sandboxing.
- Cost optimization for RAG deployments
- Sectors: software platforms
- What: Replace or complement embedding-based retrieval for local corpora to avoid index build costs and reduce API spend while improving accuracy on multi-hop tasks.
- Tools/workflow: DCI-Agent-Lite as a tool within existing agent stacks; policy-based switch to DCI when operating on local repositories.
- Assumptions/dependencies: Workload routing logic; monitoring of latency vs breadth trade-offs.
Long-Term Applications
These opportunities require further research, scaling strategies, or productization beyond the paper’s current scope (e.g., addressing breadth scaling, safety tooling, and enterprise integrations).
- Hybrid retriever+DCI pipelines for large corpora
- Sectors: enterprise search, cloud platforms
- What: Use sparse/dense retrievers for broad recall, then hand off to DCI for high-resolution localization, verification, and multi-hop chaining.
- Tools/workflow: Budget-aware router selecting retriever vs DCI; progressive narrowing with pass-off points; coverage+localization metrics to monitor.
- Assumptions/dependencies: Robust orchestration; deduplication of hits; latency/cost SLAs across modalities.
- DCI-aware enterprise knowledge browsers
- Sectors: enterprise IT, productivity software
- What: GUIs that visualize matched spans, chain-of-evidence, and bash pipelines; click-to-refine searches; exportable audit trails.
- Tools/workflow: Instrumented DCI engine; UI for span localization and pipeline editing; evidence notebooks.
- Assumptions/dependencies: Cross-format rendering; RBAC integration; usability testing for non-technical users.
- Standardized “retrieval interface resolution” evaluation in procurement
- Sectors: public sector, regulated industries
- What: Benchmarks and metrics (coverage, localization) used in vendor evaluations to ensure agents can localize and verify evidence, not just retrieve documents.
- Tools/workflow: Public test suites; logging requirements; compliance checklists for evidence localization.
- Assumptions/dependencies: Consensus on metrics; reproducible test corpora with gold spans.
- Secure agent sandboxes for shell-based search
- Sectors: platform engineering, security
- What: Policy-controlled execution environments that allow grep/rg/find safely (read-only FS, syscall filtering, egress control), with tamper-evident logs.
- Tools/workflow: OCI containers, seccomp, vfs overlays, read-only mounts; command whitelists; rate limiting and budgets per session.
- Assumptions/dependencies: Platform investment; alignment with enterprise security teams.
- Scalable “DCI caches” and lightweight indices
- Sectors: data platforms
- What: Build trigram/regexp-accelerated inverted structures (e.g., ripgrep’s index, n-gram Bloom filters) to keep DCI semantics while improving breadth scaling on tens/hundreds of millions of files.
- Tools/workflow: Periodic background indexing compatible with exact/regex semantics; file-change watchers; budgeted expansion.
- Assumptions/dependencies: Storage/compute overhead; update latency vs query freshness trade-offs.
- On-device personal knowledge management (PKM) agents
- Sectors: consumer productivity, education
- What: Agents that search and connect notes, PDFs, and course materials entirely on-device for privacy; local span verification for study and research.
- Tools/workflow: Mobile/desktop apps embedding DCI with OCR and context compression; hybrid summarization for long notes.
- Assumptions/dependencies: Efficient local LLMs or edge inference; battery/CPU constraints; robust text extraction.
- Autonomous literature-review and evidence-synthesis products
- Sectors: pharma/biomed, policy think tanks
- What: Long-horizon agents that ingest local corpora (papers, SOPs, protocols) to build evidence maps, with exact-quote verification and span-linked provenance.
- Tools/workflow: DCI for span extraction; structured evidence stores; human approval flows; bias/coverage monitoring.
- Assumptions/dependencies: High-quality PDF-to-text; domain ontologies for entity normalization; expert oversight.
- Compliance/audit copilot with explainable search trails
- Sectors: finance, healthcare, critical infrastructure
- What: End-to-end pipeline that produces audit-ready reports linking every claim to localized spans in source documents, with reproducible command traces.
- Tools/workflow: DCI with signed logs; immutable evidence bundles; “localization score” thresholds for acceptance.
- Assumptions/dependencies: Legal acceptance of digital chains of evidence; integration with GRC tools.
- OS-level “agentic filesystem” abstractions
- Sectors: operating systems, developer tools
- What: Expose DCI as a first-class OS service (FUSE-like) giving agents high-resolution read/query capabilities with quotas and observability.
- Tools/workflow: Virtual FS that supports query pipelines; kernel/user-space guards; telemetry for coverage/localization.
- Assumptions/dependencies: OS vendor collaboration; performance engineering; security certification.
- Multimodal DCI (text+OCR+tables+code)
- Sectors: enterprise data, scientific R&D
- What: Extend DCI with robust preprocessing to unify text from scans (OCR), tables (CSV/Parquet), and code, enabling cross-format multi-hop reasoning and span verification.
- Tools/workflow: OCR pipelines, table extractors, code parsers; adapters to present unified text spans to DCI.
- Assumptions/dependencies: Quality OCR for noisy scans; normalization of encodings; evaluation datasets with span-level golds.
- Procurement and policy guidelines for privacy-first search
- Sectors: public policy, compliance
- What: Guidance that prioritizes high-resolution, on-prem retrieval interfaces (like DCI) for sensitive corpora to minimize data exfiltration risks.
- Tools/workflow: Policy templates; risk assessments comparing index-based external services vs local DCI; auditing standards for span localization.
- Assumptions/dependencies: Stakeholder buy-in; alignment with data protection regulations.
- Agent toolchain standards (context management, budgets, observability)
- Sectors: AI platforms
- What: Libraries that implement the paper’s context-management strategies (truncation, compaction, summarization) and budget-aware tool-calling for long-horizon search.
- Tools/workflow: Open-source “DCI-Guard” modules; adapters for leading agent frameworks; telemetry dashboards (coverage/localization vs cost/latency).
- Assumptions/dependencies: Interop across LLM providers; stable APIs for tool-calling.
Notes on feasibility and dependencies (cross-cutting):
- Best-fit regimes: small-to-mid-sized, evolving local corpora; tasks needing exact lexical constraints, conjunctions of weak clues, and span-level verification.
- Scaling constraints: DCI breadth search costs rise with corpus size; hybrid approaches or lightweight indices mitigate this.
- Environment: Requires secure shell access and robust sandboxing; Windows may need WSL or native equivalents (findstr, PowerShell + GNU tools).
- Data prep: Non-text formats need extraction (pdftotext/OCR); multilingual corpora may need encoding normalization and language-aware regex.
- Model capability: Agents need reliable command composition and context handling; smaller models benefit from strict tool-result truncation/compaction to manage context windows.
- Governance: Logging and reproducibility are critical for regulated use; adopt evidence trail standards early.
Glossary
- Agentic search: A paradigm where language agents conduct multi-step, plan-and-revise information seeking and reasoning. "To overcome the bottleneck, in this paper, we position direct corpus interaction (DCI) as a new retrieval interface for agentic search."
- BEIR benchmark: A standardized suite of datasets for evaluating information retrieval systems. "two datasets (ArguAna and SciFact) from the BEIR benchmark (Thakur et al., 2021) for ranking evaluation."
- BM25: A classical probabilistic ranking function used in sparse retrieval based on term frequency and document statistics. "Sparse retrieval relies on lexical matching such as BM25 (Robertson et al., 1994)"
- BRIGHT benchmark: A collection of domain-specific IR datasets designed to evaluate ranking performance. "four datasets (Biology, Earth Science, Economics, and Robotics) from the BRIGHT benchmark (Su et al., 2025)"
- BrowseComp-Plus: A benchmark for evaluating deep-research agents over a controlled corpus. "On BrowseComp-Plus, replacing the Qwen3-Embedding-8B retrieval tool with DCI under the same Claude Sonnet 4.6 backbone improves accuracy from 69.0% to 80.0% (+11.0 points) while reducing cost from $1,440 to$1,016 (-29.4%)."
- Command-line interface (CLI): A text-based interface through which agents can invoke system tools (e.g., bash, grep) to manipulate and search files. "Modern agents equipped with command-line interfaces (CLI) have demonstrated a strong ability to resolve complex software engineering tasks"
- Coverage (trajectory-level metric): A process metric indicating whether relevant gold documents were surfaced during a search trajectory. "Coverage measures whether a trajectory surfaces the relevant (gold) documents at all, reflecting broad evidence access."
- Dense retrieval: Retrieval that represents texts as learned dense vectors and uses vector similarity to find relevant items. "dense retrieval performs nearest-neighbor search over learned vectors"
- Direct corpus interaction (DCI): A retrieval interface where the agent searches the raw corpus using general-purpose terminal tools rather than a retriever or index. "In direct corpus interaction (DCI), the agent bypasses any embedding model, vector index, or retrieval API"
- Embedding model: A model that converts text into vector representations for similarity search or downstream tasks. "the agent bypasses any embedding model, vector index, or retrieval API"
- FAISS index: An efficient vector similarity search index (library) commonly used to store and query embeddings. "we additionally use the released FAISS index built from Qwen3-Embedding-8B embeddings (Zhang et al., 2025) as the offline search engine."
- Gold documents: Ground-truth documents identified as relevant/evidence for a question in evaluation. "let D*(q) denote the gold documents for question q"
- Localization (trajectory-level metric): A metric assessing how precisely a trajectory isolates small, usable evidence spans within surfaced gold documents. "Localization measures how efficiently the trajectory narrows to a small, usable evidence span within each surfaced gold document"
- Multi-hop QA: Question answering that requires reasoning across multiple documents or steps to derive an answer. "On multi-hop QA, combining DCI with Claude Code as the command-line interface agent achieves 83.0 average accuracy"
- NDCG@10: Normalized Discounted Cumulative Gain at cutoff 10; a ranking metric evaluating the quality and order of top-10 results. "the same setup reaches 68.5 average NDCG@10, outperforming the best retrieval baseline (Liu et al., 2025) by 21.5 points."
- Nearest-neighbor search: A vector-space search procedure that retrieves items whose embeddings are closest to a query vector. "dense retrieval performs nearest-neighbor search over learned vectors"
- Open-domain question answering: Answering questions using broad, unstructured corpora rather than a closed set of documents. "open-domain question answering (Trivedi et al., 2022; Press et al., 2023)"
- Pareto frontier: The set of solutions that optimally trade off two objectives (e.g., cost and performance) where no point can improve one without worsening the other. "Figure 1: Pareto frontier of performance vs. cost on BrowseComp-Plus, comparing two paradigms:"
- RAG (Retrieval-Augmented Generation): A pipeline where documents are retrieved first and then used to condition the generation of an LLM. "RAG (Guu et al., 2020; Lewis et al., 2020; Borgeaud et al., 2022; Asai et al., 2023; Ram et al., 2023; Gao et al., 2023; Shi et al., 2024) augments LLMs with external knowledge via a retrieve-then-generate pipeline"
- Regular expression (regex): A formal pattern language for matching strings, enabling precise text search operations. "the agent issues tool calls such as grep and rg for exact or regular-expression matches"
- Reranking: Re-ordering an initial set of retrieved candidates using a secondary, often stronger, model to improve final ranking quality. "Subsequent advances, including LLM-based reranking (Sun et al., 2023; Zhuang et al., 2025; Weller et al., 2025) and adaptive RAG (Jeong et al., 2024), improve individual stages but retain the same underlying retrieval structure."
- Sparse retrieval: Retrieval based on lexical overlap and term statistics rather than learned dense vectors. "Sparse retrieval relies on lexical matching such as BM25 (Robertson et al., 1994), while dense retrieval performs nearest-neighbor search over learned vectors"
- Top-k retrieval: A retrieval interface that returns only the k highest-scoring items per query. "exposes a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning."
- Vector index: A data structure enabling efficient similarity search over embedding vectors. "without any embedding model, vector index, or retrieval API"
Collections
Sign up for free to add this paper to one or more collections.