RAGSmith: Modular RAG Pipeline Search

Updated 3 February 2026

RAGSmith is a modular optimization framework that formulates RAG pipeline composition as a global architecture search across nine technique families.
It employs a steady-state genetic algorithm to navigate 46,080 configurations, capturing non-linear interdependencies among retrieval, ranking, augmentation, prompting, and generation modules.
Empirical evaluations on six Wikipedia-derived domains demonstrate significant, domain-aware performance gains and offer practical guidelines for effective RAG deployment.

RAGSmith is a modular optimization framework for Retrieval-Augmented Generation (RAG) that formulates end-to-end RAG pipeline composition as a global architecture search problem across nine distinct technique families, yielding 46,080 possible pipeline configurations. Unlike approaches that optimize RAG modules—retrieval, ranking, augmentation, prompting, and generation—in isolation, RAGSmith performs holistic optimization, allowing discovery of non-linear interdependencies between modules. A steady-state genetic algorithm governs this search, optimizing a unified scalar objective that jointly aggregates retrieval and generation evaluation metrics. Empirical assessment across six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, and Computer Science) demonstrates consistent, domain-aware pipeline improvements over naive RAG baselines, with both a robust, cross-domain backbone and domain-specific module selection. The framework offers practical, empirically validated design guidelines for deploying effective RAG systems in varied knowledge domains (Kartal et al., 3 Nov 2025).

1. End-to-End Optimization Paradigm and Motivation

RAGSmith treats the design of Retrieval-Augmented Generation pipelines as an end-to-end, globally optimized problem rather than a mere assemblage of independently tuned components. Traditional RAG efforts focus sequentially on retrieval, reranking, augmentation, prompting, and answer generation, often heuristically combining "best" techniques from each stage. This approach is brittle, as it ignores cross-module interactions: modules that individually perform well may interact suboptimally when composed. RAGSmith mitigates this by exhaustively instantiating all feasible pipelines from a modular design space, evaluating each pipeline end-to-end, and selecting globally optimal configurations via search. The framework leverages evolutionary search to capture synergistic module interactions. This holistic approach is vital because RAG pipeline effectiveness is highly sensitive to both domain properties (e.g., chunk density, passage informativeness) and question typology (factual, interpretation, long-answer).

2. Modular Design Space: Technique Families and Pipeline Enumeration

The configurable design space of RAGSmith comprises nine orthogonal "technique families," each instantiated by choosing a single option (or a no-operation) per pipeline, subject to inter-family compatibility constraints. The families are:

Technique Family	Example Members (not exhaustive)	Functional Role
Pre-Embedding	Contextual Chunk Headers, Hypothetical Prompt Emb	Alters query or passage representation
Query Expansion	Multi-Query Retrieval, HyDE, Decomposition	Diversifies/reforms retrieval queries
Retrieval	Vector, BM25, Hybrid, Graph, Complete Hybrid	Selects candidate chunks from corpus
Reranking	Cross-Encoder, LLM reranker, Hybrid Rerank	Refines initial retrieval pool
Passage Filtering	Top-K, Similarity Threshold	Prunes passage set
Passage Augmentation	Prev-Next Augmenter, Relevant Segment Extraction	Enriches passage context
Passage Compression	Tree-Summarize, LLM-Refining	Reduces content length
Prompt Maker	Concatenation, Long-Context Reordering	Constructs LLM input
Post-Generation	Self-RAG Reflection & Revision	Post-processes LLM outputs

Disallowed combinations—such as reranking with fewer than $k$ passages—are filtered, resulting in a feasible search space of exactly 46,080 candidate pipelines.

3. Genetic Search Methodology

Optimizing over this vast combinatorial space is computationally intractable via brute-force. RAGSmith employs a steady-state genetic algorithm configured as follows:

Population: $P=16$ pipeline candidates per generation, run for $T=20$ generations.
Genetic Operators: Uniform crossover (probability $p_c=0.6$ ), adaptive mutation ( $p_m\in[0.01,0.2]$ ) based on population diversity, implemented as gene index flips.
Selection: Elitist selection retains the top $k=5$ candidates per generation.
Termination: The search halts if 100 consecutive pipeline evaluations yield no improvement or an ideal fitness score ( $F(x) = 1.0$ ) is reached.

Each candidate pipeline $x$ is built by selecting one technique from each family and evaluated in its entirety. Empirically, the algorithm converges in approximately 100 unique pipeline evaluations—about $0.2\%$ of the full configuration space—demonstrating highly efficient exploration.

4. Scalar Objective and Evaluation Metrics

Candidate pipelines are evaluated via a scalar objective $F(x) \in [0,1]$ composed of both retrieval and generation metrics, ensuring end-to-end optimization fidelity. Let

$\mathrm{Recall}@k(x)$ : Fraction of answer-containing chunks in top- $k$ retrievals,
$\mathrm{mAP}(x)$ : Mean Average Precision,
$\mathrm{nDCG}@k(x)$ : Normalized Discounted Cumulative Gain at $k$ ,
$\mathrm{MRR}(x)$ : Mean Reciprocal Rank,
$\text{LLM-Judge}(x)$ : LLM-based output quality score,
$\text{Semantic}(x)$ : Cosine similarity between generated output and ground truth embeddings.

Scores are aggregated: $\text{RetrievalScore}(x) = \frac{1}{4}\big(\mathrm{Recall}@k(x) + \mathrm{mAP}(x) + \mathrm{nDCG}@k(x) + \mathrm{MRR}(x)\big)$

$\text{GenerationScore}(x) = \frac{1}{2}\left(\text{LLM-Judge}(x) + \text{Semantic}(x)\right)$

$F(x) = \frac{1}{2}\left(\text{RetrievalScore}(x) + \text{GenerationScore}(x)\right)$

All metrics are normalized to $[0,1]$ , and $F(x)$ serves as the fitness function guiding genetic search.

5. Empirical Evaluation and Domain Coverage

RAGSmith was assessed on six Wikipedia-derived domains: Mathematics, Law, Finance, Medicine, Defense Industry, and Computer Science. Each domain contained 100 expert-authored questions, manually annotated as factual (24% overall), interpretation (45%), or long-answer (31%), with domain-specific distributions. The document corpora were chunked into $\sim35$ --$51$ overlapping 200–512-token passages per article, totaling 2,619 chunks with 490,610 tokens overall.

Per-pipeline evaluation automatically aggregates retrieval and generation scores for all questions. This framework ensures methodology consistency across diverse domains and question types, supporting robust cross-domain analysis.

6. Quantitative Performance Gains

Relative to a naive RAG baseline (vector retrieval, simple concatenation, no advanced modules), RAGSmith-optimized pipelines yielded positive and interpretable improvements:

Overall: Average +3.8% (range +1.2% to +6.9%) improvement across domains.
Maximal gains: +12.5% retrieval (Computer Science), +7.5% generation (Mathematics).
Per-domain aggregate:

Domain	Retrieval $\Delta$	Generation $\Delta$	Overall $\Delta$
Computer Science	+12.5%	+1.8%	+6.9%
Mathematics	+5.4%	+4.4%	+5.1%
Finance	+7.8%	+1.1%	+4.4%
Law	+5.4%	+2.4%	+3.5%
Medicine	+1.0%	+4.2%	+1.9%
Defense Industry	+1.3%	+1.2%	+1.2%

Improvement magnitude demonstrates inverse correlation with interpretation question prevalence: domains with higher proportions of factual/long-answer questions benefit more, while interpretation-heavy domains (Finance, Law, Medicine, Defense) experience smaller gains. This suggests that current RAGSmith-optimized compositions exploit factual recall and long-form synthesis, while interpretive reasoning could require complementary modules.

7. Discovered Backbone and Practical Design Guidance

Across all domains, the genetic search consistently converged to a robust backbone: vector retrieval as the core retrieval method, and post-generation reflection/revision (Self-RAG) for output refinement. Module selection in remaining families showed domain-dependence:

Query Expansion: Multi-query for moderate chunk density ( $<$ 50), not selected for high-density domains (Medicine, Defense).
Reranking: Cross-encoder for small retrieved sets, hybrid reranking for large candidate pools.
Passage Augmentation: Prev-next augmentation for uniform density; adaptive relevant-segment extraction for variable density domains (Law, Defense).
Prompt Maker: Long-context reordering used when prompt length exceeded $\sim$ 1,000 tokens or chunk importance varied.
Passage Compression: Never selected—summary or compression steps consistently degraded end-to-end accuracy.

Empirically grounded practical guidelines follow: always instantiate pipelines with the vector retrieval plus reflection/revision core; tailor expansion, reranking, augmentation, and reordering modules to dataset-specific properties (chunk count, token distribution, question types). Passage compression should be avoided except under extreme context length constraints, as it discards necessary context.

In summary, RAGSmith demonstrates that evolutionary full-pipeline optimization, rather than isolated module tuning, produces consistent and interpretable gains in RAG system performance across varying knowledge domains and question types (Kartal et al., 3 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RAGSmith.