RAG Architecture Search
- RAG Architecture Search is the systematic optimization of modular retrieval-augmented generation systems, integrating processes like query rewriting, chunking, retrieval, and generation.
- It encompasses diverse optimization techniques—including black-box, genetic, and declarative approaches—to jointly balance accuracy, latency, token budgets, and reproducibility.
- The topic addresses challenges such as adaptive retrieval strategies, structured pipeline design, and production constraints, highlighting actionable insights for scalable QA systems.
RAG architecture search is the systematic study of how a retrieval-augmented generation system should be composed, parameterized, and optimized across retrieval, reranking, context construction, and generation, rather than treated as a fixed retrieve-then-generate template. In the recent literature, this problem is formulated variously as black-box optimization over pipeline configurations, declarative composition of typed IR transformations, redesign of retrieval into staged or hierarchical procedures, and extension of the search space to include governance, data-layer, and deployment constraints. The common premise is that end-to-end RAG quality depends on interacting architectural choices, so optimizing modules in isolation is brittle and often unreproducible (Chen et al., 28 May 2026, Kartal et al., 3 Nov 2025, Macdonald et al., 12 Jun 2025).
1. Problem formulation and conceptual scope
A central shift in the field is the explicit treatment of RAG design as an architecture search problem. RAISE defines a modular search space
and optimizes a configuration by maximizing the average evaluation score of the full pipeline over a dataset. In this formulation, query rewriting, chunking, retrieval depth, reranking, pruning, and generation are all first-class search variables rather than fixed implementation details (Chen et al., 28 May 2026). RAGSmith makes the same point from a different angle: it defines a scalar fitness
where retrieval and generation quality are aggregated equally, thereby rejecting greedy per-module selection in favor of joint end-to-end optimization (Kartal et al., 3 Nov 2025).
This reframing broadens the meaning of “search” in RAG. In the review literature, the architecture space is organized not only by retriever class, but also by query expansion, data organization, retriever and reranker choice, generation strategy, tool use, and multimodal extensions. Under this view, architecture search includes decisions such as whether to rewrite the query, whether to retrieve sparsely, densely, or hybridly, how to structure the corpus as chunks, hierarchies, graphs, tables, or HTML, and whether retrieval should be single-shot, iterative, or conditionally invoked at all (Wang et al., 12 Oct 2025).
A plausible implication is that RAG architecture search should be understood less as a narrow hyperparameter problem and more as a structured design problem over interacting subsystems. The cited work is consistent in arguing that the architecture determines not only answer quality, but also token budget, latency, reproducibility, observability, and safety.
2. Search frameworks, declarative composition, and experimental substrates
Three lines of work make the search problem operational. PyTerrier-RAG casts RAG as a declarative IR pipeline over typed relations such as , , , and . Its main RAG flow is formalized as Retriever , optional reranker , concatenation or context building , and reader 0. The operator notation >>, +, |, and % makes architectural variation explicit: sequential composition, score fusion, set union, and rank cutoff can be changed by modifying a single pipeline expression rather than imperative orchestration code (Macdonald et al., 12 Jun 2025).
RAISE turns this modularity into a controlled benchmark. It implements 13 search algorithms—Random Search, Greedy Search, Coordinate Descent, Simulated Annealing, Iterative Local Search, TPE, Cross-Entropy Method, Regularized Evolution, Thompson Sampling, UCB, GRPO, Dr. GRPO, and Reinforce++—and evaluates them on seven public text and multimodal datasets with three random seeds, using a budget of 30 configuration evaluations per run (Chen et al., 28 May 2026). RAGSmith, by contrast, searches a larger but explicitly enumerated modular space: nine technique families and 46,080 feasible pipeline configurations, explored with a genetic algorithm that typically evaluates about 100 unique candidates, or about 0.2% of the space (Kartal et al., 3 Nov 2025).
| Framework | Search mechanism | Architectural scope |
|---|---|---|
| RAISE | 13 optimizers under fixed budgets | rewrite, chunk, retrieve, rerank, prune, generate |
| RAGSmith | genetic search over 46,080 feasible configurations | nine technique families, joint retrieval and generation fitness |
| PyTerrier-RAG | declarative operator composition | retrieval, reranking, context construction, generation |
These systems differ in emphasis. PyTerrier-RAG is primarily a compositional substrate for experimentation; RAISE is a benchmark for optimizer comparison under standardized environments; RAGSmith is an end-to-end search framework aimed at selecting a strong pipeline per domain. Together they establish the methodological core of the area: a search space must be explicit, the evaluation budget must be controlled, and pipeline components must be swappable without changing the underlying data model.
3. Retrieval-layer architecture as a search object
A substantial body of work argues that retrieval itself should be architecturally redesigned rather than merely accelerated. “Progressive Searching for Retrieval in RAG” replaces brute-force full-dimensional nearest-neighbor retrieval with a multi-stage cascade. Starting from a low-dimensional search over the full 1M-document corpus, it repeatedly doubles the embedding dimensionality while shrinking the candidate set, and finishes with a final 1-NN search at the maximum dimension. On dbpedia-openai-1M-1536-angular, using 2,470 clean query-document pairs and brute-force NearestNeighbors with Euclidean distance, the method matches full-dimensional top-1 accuracy at much lower runtime. For the Alibaba embedding, truncated exact retrieval at 3584 dimensions reached 95.02% top-1 accuracy in 99.36 seconds, whereas progressive retrieval achieved the same 95.02% in 20.63 seconds; at 2048 dimensions, both reached 94.82%, but progressive retrieval took 12.17 seconds versus 57.49 seconds (Jeong et al., 7 Feb 2026).
SINR, or Search-Is-Not-Retrieve, attacks a different failure mode: the conflation of semantic matching and contextual assembly. It introduces a dual-layer architecture with fine-grained search chunks 1, typically around 100–200 tokens, and coarse-grained retrieve chunks 2, typically around 600–1000 tokens, connected by a deterministic parent map 3. The search layer is optimized for semantic specificity, the retrieve layer for contextual sufficiency. The paper reports negligible storage overhead, roughly 4 embeddings plus 2% mapping, and claims prototype reductions of 40–60% in search index size and 20–30% in average query latency relative to flat RAG, with higher context quality and better deduplication (Nainwani et al., 7 Nov 2025).
A third line of evidence suggests that exact search quality is often overvalued relative to downstream answer quality. “Toward Optimal Search and Retrieval for RAG” studies dense and multi-vector retrieval with approximate search and shows that lowering ANN search recall has minor implications for QA performance when gold evidence remains present in the prompt. The paper reports that performance typically improves rapidly when moving from 0 retrieved documents to a small number of documents, plateaus around 5–10 documents for Mistral-like readers, and can degrade when too many documents are included. It also shows that reducing search recall@10 from exact to 0.7 only modestly reduces document recall, supporting the practical conclusion that approximate search is often a good engineering trade-off for latency and memory efficiency (Leto et al., 2024).
Taken together, these results shift architecture search toward retrieval staging, granularity separation, and evidence sufficiency, rather than toward exact nearest-neighbor fidelity alone.
4. Agentic, hierarchical, and graph-native search strategies
Agentic RAG expands architecture search from static pipelines to policies over retrieval actions. A-RAG exposes three retrieval tools—keyword search, semantic search, and chunk read—through a hierarchical interface that allows the model to choose among exact lexical matching, sentence-level semantic matching, and full chunk inspection. Its hierarchical index uses chunks of about 1,000 tokens and preserves sentence-to-chunk mappings for progressive evidence gathering. On multiple open-domain QA benchmarks, A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens; with GPT-5-mini, A-RAG (Full) reports 94.5 / 88.0 on HotpotQA, 89.7 / 88.9 on 2WikiMultiHopQA, 74.1 / 65.3 on MuSiQue, and 85.3 on GraphRAG-Bench (Du et al., 3 Feb 2026).
RAG-Gym generalizes this agentic perspective by treating agentic RAG as a nested MDP and optimizing three dimensions: prompt engineering, actor tuning, and critic training. Its ReSearch architecture summarizes retrieved evidence, generates candidate answer reasoning, identifies unsupported claims, and converts those missing claims into targeted retrieval queries. The paper compares Direct, CoT, RAG, ReAct, Search-o1, and ReSearch under SFT, DPO, and PRM supervision, and reports that ReSearch with PRM reaches average EM 54.31 and average F1 62.41, while a trained critic can improve inference by selecting higher-quality intermediate reasoning steps (Xiong et al., 19 Feb 2025).
SHRAG occupies a related but distinct point in the design space. It uses an LLM as a Query Strategist to extract multilingual keywords, generate OR-based Boolean queries
5
retrieve top-10 documents for each query, deduplicate them, and rerank them with a multilingual embedding model before structured answer generation. On MIRACL, it reports a Query Success Rate of 94 overall, with 100 for English and 88 for Korean, and finds that OR-only query generation outperforms mixtures with AND (Ryu et al., 30 Nov 2025). “Keyword search is all you need” pushes the simplification further by comparing vector RAG with a ReAct-style agent that uses rga and pdfgrep over raw documents. Across several LlamaHub datasets, the keyword-search agent attains 94.52% faithfulness, 88.05% context recall, and 91.48% answer correctness relative to traditional RAG, without maintaining a standing vector database (Subramanian et al., 19 Dec 2025).
Graph-native systems reinterpret architecture search at the representation level. ArchRAG builds attributed communities over a knowledge graph, organizes them with a hierarchical C-HNSW index, and performs adaptive filtering-based generation. It reports best-in-table specific-QA results on Multihop-RAG, HotpotQA, and NarrativeQA, and reduces token usage on HotpotQA from 1,394M tokens for GraphRAG-Global to 5.1M tokens, summarized as about 250× less token usage (Wang et al., 14 Feb 2025). “Graphs RAG at Scale” compares RDF-based and LPG-based Graph RAG for complex semi-structured corpora. On 200 questions over 1104 Capital Group fund records, it reports overall scores of 116 for Agentic RAG, 172.5 for RDF Graph RAG, and 185.5 for LPG Graph RAG, while the LPG text-to-Cypher framework is stated to achieve over 90% accuracy in real-time translation (Tadayon et al., 21 Mar 2026).
These systems show that architecture search can target not only the order of modules, but also the action space available to the model and the representation used for the underlying knowledge.
5. Production, governance, and deployment constraints
In production settings, architecture search extends beyond retrieval effectiveness. “Beyond Similarity Search” argues that the conventional split data layer—vector DB, relational metadata store, cache, and application glue—produces staleness, tenant leakage, and query composition explosion. Its proposed unified PostgreSQL data layer with pgvector and HNSW stores documents, embeddings, metadata, and access policies in one system and executes retrieval with metadata predicates in a single SQL query. On 50,000 documents, the paper reports 92% latency reduction for date-filtered queries, 74% for tenant-scoped queries, zero synchronization inconsistency, 0% cross-tenant leakage versus 0.2% in the split setup, and about 93% less synchronization code (Budigi et al., 5 May 2026).
Policy-governed RAG introduces another dimension: ex-ante compliance and auditability. Its architecture is a triptych of Contracts/Control, Manifests/Trails, and Receipts/Verification. A governed route returns one of PROMOTE_FULL, PROMOTE_LITE, or ABSTAIN, while evidence is anchored in Merkle-rooted manifests and final outputs are bound to portable COSE/JOSE receipts. The design specifies targets such as a 6 relative reduction in confident-error@t, 7 ms latency, serving cost 8, and proof SLOs including proof_size_p90 < 64 KB and proof_verify_p95 ≤ 200 ms (Ray, 22 Oct 2025). In this setting, architecture search includes policy enforcement, provenance, and replayability, not merely answer quality.
Resource constraints similarly induce distinctive architectures. MobileRAG redesigns the classic two-stage RAG pipeline for on-device use with EcoVector, a partitioned disk-backed vector index, and Selective Content Reduction (SCR), which filters irrelevant text before generation. SCR reduces context size from 155 to 90 tokens on SQuAD, from 309 to 287 on HotpotQA, and from 287 to 198 on TriviaQA, while the full system reports TTFT improvements and substantial power savings relative to Naive-RAG, EdgeRAG, and Advanced RAG (Park et al., 1 Jul 2025). Driving-RAG addresses autonomous-driving scenario retrieval with aligned scenario embeddings, HNSW-TSD retrieval, and graph-knowledge reorganization; it reports that 64 dimensions is the best balance of search efficiency and accuracy, and that HNSW-TSD is about an order of magnitude faster with little accuracy loss (Chang et al., 6 Apr 2025). A separate locally deployed Ukrainian QA system uses document-level and page-level hybrid retrieval, cross-encoder reranking, and a 4-bit GGUF-compressed generator, achieving 2nd place in the UNLP 2026 Shared Task with a private test score of 0.942 under a single P100 GPU and a 9-hour runtime limit (Trokhymovych et al., 23 Apr 2026).
This suggests that practical RAG architecture search is inseparable from workload semantics: enterprise isolation, auditability, battery and thermal budgets, and domain-specific grounding requirements all alter what “optimal” means.
6. Evaluation findings, robustness, and unresolved questions
A consistent empirical conclusion is that there is no universal best RAG architecture. RAISE shows that optimizer performance is highly task-dependent: Greedy Search wins HotpotQA and TriviaQA, CEM wins MS MARCO, Random Search wins ScienceQA, GRPO wins SQuAD v2, Regularized Evolution wins LongBench-Qasper, and Simulated Annealing wins LongBench-Multifield. It also reports that Coordinate Descent has the best average rank while not winning any individual dataset, which is used to caution against interpreting aggregate rankings as evidence of universally superior strategies (Chen et al., 28 May 2026).
RAGSmith reaches a similar conclusion at the pipeline level. Across six Wikipedia-derived domains, it reports an average +3.8% improvement over a naive RAG baseline, with domain gains ranging from +1.2% to +6.9%. It identifies a robust backbone—vector retrieval plus reflection/revising post-generation—that appears in the best configuration for all six domains, while passage compression is never selected. Improvement magnitude is also question-type-dependent: interpretation-heavy datasets show smaller gains than factual or long-answer mixes (Kartal et al., 3 Nov 2025).
Robustness under adversarial evidence introduces a further criterion for architecture selection. “Architecture Matters” evaluates vanilla RAG, agentic RAG, MADAM-RAG, and Recursive LLMs under single-document knowledge-base poisoning on 921 Natural Questions pairs. Under CorruptRAG-AK, clean-conditioned attack success rates range from 81.9% for vanilla RAG to 24.4% for RLM, with agentic RAG at 43.8% and MADAM-RAG at 45.5%. The paper further decomposes the CorruptRAG-AK advantage and concludes that, for three of four architectures, adversarial framing at the content-reasoning stage rather than retrieval optimization drives most of the attack’s advantage (Korn, 7 May 2026).
The present literature therefore defines several unresolved questions. Search spaces are now explicit, but their most appropriate granularity remains unsettled. Agentic systems improve adaptability, but they shift failure modes toward reasoning-chain errors. Graph-native systems improve structured retrieval, but depend on graph construction or schema design. Production architectures reduce sync lag and leakage, but concentrate responsibility in a unified substrate. Policy-governed systems add auditability, but at the price of abstention logic and operational complexity. A plausible implication is that future RAG architecture search will be increasingly multi-objective: accuracy, latency, token budget, freshness, access control, contradiction handling, provenance, and energy are all becoming optimization targets rather than secondary constraints.