RAG-Stack: Retrieval-Augmented Generation Optimization

Updated 24 October 2025

RAG-Stack is a comprehensive framework that defines a layered architecture using RAG-IR to decouple quality from performance.
It leverages RAG-CM with analytical, ML, and profiling approaches to accurately predict system performance and resource needs.
RAG-PE employs iterative, sample-efficient exploration to automatically discover Pareto-optimal configurations for scalable deployments.

Retrieval-Augmented Generation Stack (RAG-Stack) is a comprehensive design and optimization framework for end-to-end RAG systems, aiming to jointly optimize both generation quality and system performance. It addresses the complexity of algorithm- and system-level configuration in large-scale question answering, search, and LLM applications, particularly when driven by vector databases and contemporary LLM backends. The RAG-Stack design is structured around three foundational pillars: an Intermediate Representation (RAG-IR) that abstracts the interplay between algorithmic and systems choices, a Cost Model (RAG-CM) that predicts system behavior and performance from this abstraction, and a Plan Exploration (RAG-PE) algorithm for discovering Pareto-optimal RAG configurations. This three-pillar blueprint enables rigorous quality–performance co-optimization, facilitating systematic, reproducible, and hardware-adaptive deployment of RAG architectures (Jiang, 23 Oct 2025).

1. RAG-IR: Intermediate Representation Layer

The RAG-IR (Intermediate Representation) serves as an abstraction layer that decouples quality and performance aspects within a modern RAG system. RAG-IR represents a system’s configuration as dataflow graphs, in which each node corresponds to a model (e.g., LLMs, query rewriters, rerankers) or a retrieval/database operator. Each node is annotated only with attributes essential for performance modeling:

Database nodes: Number of documents/rows, vector dimensionality, Top-K, retrieval metrics such as recall, and index configuration.
Model nodes: Model architecture, number of parameters, input/output sequence lengths, and features such as KV-cache reuse.

Edges encode data movement between computational components and specify the volume of information, such as token counts or vector transfer, crucial for estimating system throughput and latency.

Notably, RAG-IR intentionally omits properties that influence only model quality (like training data identities or model version strings), focusing on those relevant for performance. This design permits isolation of quality-tuning from hardware- and configuration-based optimizations. Mathematically, an algorithm configuration $a$ is mapped to an abstraction $\operatorname{IR}(a)$ , which in turn is provided to the cost model for performance estimation: $\hat{p} = \text{perf}(\operatorname{IR}(a), H)$ where $H$ denotes the available hardware/software stack.

The RAG-IR abstraction is pivotal because it allows performance modeling, resource allocation, and further system-level decision making to proceed independently of the details of the LLMs or tokenization schemes, as long as the relevant computational resource features (e.g., sequence length, model size) are captured.

2. RAG-CM: Cost Model for Performance Estimation

RAG-CM is the system’s cost model, tasked with estimating system performance given a specific RAG-IR. It can predict true or approximate values for latency, throughput, or other hardware-level metrics (e.g., Time-To-First-Token [TTFT], Time-Per-Output-Token [TPOT], or requests per second).

RAG-CM can be constructed via several methods:

Analytical modeling: Extensions of the roofline model or operator-level cost decomposition provide first-principles estimates of compute and memory bottlenecks, such as:

$\text{Runtime} = \max \left( \frac{\text{FLOP count}}{\text{Peak FLOPS}}, \frac{\text{bytes transferred}}{\text{Memory BW}} \right)$

And more detailed breakdown per pipeline component, which incorporates batch sizes and hardware-specific parameters.

ML-based models: Supervised regressors or neural predictors are trained on profiling data collected from the deployment platform, offering higher accuracy for complex, black-box model+database compositions at the expense of interpretability.
Profiling-based models: Empirically measured performance curves from the target stack are mapped back onto IR configurations; feasible for moderately sized configuration or system design spaces.

RAG-CM thus enables efficient performance prediction without requiring full-system deployment for each configuration under consideration, a property critical for rapid architecture search and system tuning.

3. RAG-PE: Plan Exploration for Pareto Frontier Search

RAG-PE, the plan exploration algorithm, automates search over the vast RAG algorithm-system design space to find configurations that offer optimal trade-offs between generation quality and system performance (i.e., Pareto optimality). It is built as an iterative, sample-efficient planner that:

Initializes with a plausible starting configuration $a_0$ , possibly informed by expert knowledge or simple heuristics.
For each candidate configuration $a_i$ $a_{i}$ :
- Measures or predicts the generation quality $q_i$ (e.g., accuracy, pass@k, F1) by running or simulating the system.
- Predicts performance $\hat{p}_i$ via the RAG-CM and IR.
- Adds $(q_i,\hat{p}_i)$ to the Pareto set if it is not dominated.
Proposes a new configuration $a_{i+1}$ by leveraging the history $\{a_j,q_j,\hat{p}_j: 0 \leq j \leq i\}$ , using grid search, Bayesian optimization, reinforcement learning, or other policy search techniques, with a view toward minimizing expensive reconfigurations (such as index re-builds or large-scale vector re-encoding).

This dataflow ensures that exploration quickly hones in on promising (quality, performance) regions without exhaustively enumerating all possible systems. Plan exploration is facilitated by the RAG-IR abstraction, which ensures that the diversity of configurations can be evaluated efficiently in a consistent format.

4. Quality–Performance Co-Optimization Challenges

RAG systems are characterized by a multiplicity of "knobs" on both algorithmic and systems levels. Crucial configuration decisions—such as Top-K in nearest-neighbor search, chunk size, use of speculative retrieval or hybrid indices, model size, KV-cache strategies—often exert complex, non-monotonic effects on both performance and generation quality.

Specific scenarios highlighted include:

Two retrieval strategies achieving similar recall (quality) but vastly different computation or memory costs.
Lowering Top-K may accelerate vector search but increase the risk of missing critical context.
Speculative retrieval and KV-cache reuse strongly accelerate LLM inference but may degrade answer quality through reduced evidence integration or exposure to stale context.
Hardware–software co-design factors, such as GPU–CPU batch scheduling interaction, affect not only absolute system throughput but also queueing and token output variance.

RAG-Stack's separation-of-concerns via RAG-IR decouples these cross-cutting effects and enables efficient tradeoff search via RAG-PE and accurate performance estimation (RAG-CM). This modularization facilitates systematic adaptation to changes in application, deployment hardware, or retrieval algorithm.

5. Broader Implications and Research Directions

The RAG-Stack blueprint is described as likely to become the de facto paradigm for quality–performance co-optimization in production and research RAG deployments. As vector databases, LLMs, and distributed computation infrastructure advance rapidly, challenges in scaling, generalization, and adaptability require blueprints that:

Support diverse, evolving IR abstractions to model emerging index types (dense, sparse, hybrid), specialized hardware (CPU/GPU/FPGAs), and new co-processing strategies.
Offer cost models (RAG-CM) extensible to distributed, multi-tier deployments (e.g., edge/cloud mixes, as discussed in related work).
Incorporate plan exploration (RAG-PE) methods that minimize the cost of disruptive system-level actions (such as index rebuilding) while maintaining a continuously updated Pareto optimal configuration set.
Account for complex interactions between quality-centric configurations (e.g., prompt templates, rerankers, hybrid retrieval) and hardware–software optimizations.

Open research challenges include modeling database index configuration within RAG-IR, efficient cost-model learning under nonstationary hardware and application contexts, and integrating dynamic runtime adaptation mechanisms into the plan exploration process.

Future RAG-Stack evolution will revolve around automating end-to-end RAG design—including quality profiling, system deployment, and operational feedback loops—in a way that enables principled, hardware- and domain-agnostic optimization.

6. Summary Table of RAG-Stack Pillars

Pillar	Role	Key Mechanisms/Concepts
RAG-IR	Abstraction & Decoupling	Dataflow graphs; node annotations (compute/retriever ops, shape, chunking, Top-K, etc.)
RAG-CM	System Performance Estimation	Analytical, ML, or profiling-based models referencing IR and hardware/software properties
RAG-PE	Configuration Search/Optimization	Iterative plan exploration, Pareto frontier inference using IR and CM evaluations

This triad enables systematic, scalable, and adaptive quality-performance optimization for retrieval-augmented language generation pipelines (Jiang, 23 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

RAG-Stack: Co-Optimizing RAG Quality and Performance From the Vector Database Perspective (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RAG-Stack.