RAG-CM Cost Model

Updated 24 October 2025

RAG-CM is a cost estimation model that quantitatively predicts key metrics for RAG pipelines based on a high-level configuration description.
It utilizes analytical, ML-based, and empirical profiling methods to balance accuracy, generalization, and practicality in performance prediction.
By integrating with RAG-Stack, RAG-CM enables systematic quality–performance co-optimization and efficient exploration of diverse configuration spaces.

Retrieval-Augmented Generation Cost Model (RAG-CM) is the cost estimation pillar within the RAG-Stack architecture, designed to quantitatively predict performance metrics for complete RAG pipelines given a high-level configuration description. RAG-CM abstracts the complex interplay between retrieval algorithms, LLM components, hardware back-ends, and their configuration “knobs,” providing an essential foundation for quality–performance co-optimization as proposed in the RAG-Stack blueprint (Jiang, 23 Oct 2025).

1. Role and Purpose of RAG-CM

RAG-CM functions as the system performance estimator for a RAG workload configuration, which is specified through the RAG-IR—the intermediate representation encoding the complete dataflow and all key algorithmic and systems parameters. For any supplied RAG-IR and resource description, RAG-CM predicts key metrics such as:

Latency to First Token (TTFT)
Time per Output Token (TPOT)
Requests per Second (RPS)
Cost per Request

These predictions bridge the gap between algorithm engineering (choice of retrieval strategy, LLM selection, data chunking, Top-K value, etc.) and the concrete system-level resource consumption, enabling rapid, systematic exploration of the trade-off surface.

2. Modeling Methodologies

RAG-CM can be instantiated with several modeling paradigms, each providing different accuracy–generalization–practicality trade-offs:

Method Class	Description	Typical Use Case
Analytical Models	Hand-coded, often roofline-style or queueing models	Quick, interpretable estimations
ML-Based Predictors	Machine learning fitted to empirical profiling data	Higher accuracy, needs more data
Empirical Profiling	Execution on target hardware with real measurements	High-fidelity, limited scalability

Analytical Cost Models

A canonical analytical approach, inspired by the roofline model, predicts the total execution time $T$ as the maximum of the compute-bound and memory-bound costs:

$T = \max \left( \frac{\mathrm{FLOPs}}{\mathrm{Peak\_FLOP}}, \frac{\mathrm{Memory\_Access\_Bytes}}{\mathrm{Mem\_Bandwidth}} \right)$

where

$\mathrm{FLOPs}$ : total floating point operations (estimated per node from RAG-IR),
$\mathrm{Peak\_FLOP}$ : hardware peak computational throughput,
$\mathrm{Memory\_Access\_Bytes}$ , $\mathrm{Mem\_Bandwidth}$ : data movement metrics determined from the dataflow.

More elaborate analytical models may use operator-level tables or hybrid formulations to capture hardware- and workload-specific nuances.

ML and Empirical Models

ML-based predictors learn non-linear mappings from configuration features (extracted from RAG-IR) to performance, trained using measured or simulated data.
Empirical profiling runs the workload configuration on the deployment environment, using the observed value as the cost.

3. Integration with RAG-Stack: IR and PE Interoperability

RAG-CM interfaces bidirectionally within the RAG-Stack paradigm:

Input: A RAG-IR describing the workload as an annotated dataflow graph, with nodes for each retriever, re-ranker, generator, etc., and all necessary attributes (e.g. chunk sizes, Top-K, sequence lengths, device mappings).
Output: Performance statistics—either direct measurements or accurate estimates—for that configuration–hardware pair.

These metrics are then consumed by the Plan Exploration (RAG-PE) module, which iteratively probes the configuration space, measuring generation quality and selecting candidate plans along (or near) the empirical Pareto frontier of quality and performance.

This clean abstraction allows RAG system developers to iterate on retrieval and augmentation algorithms without low-level systems tuning, while infrastructure engineers can analyze the impact of hardware or system choices independently.

4. Model Application: Practical Scenarios and Case Studies

Prototype case studies within RAG-Stack demonstrate RAG-CM’s applied utility:

For two database indexing strategies (meeting identical quality/recall levels), RAG-CM accurately reflected divergent execution times due to different compute/memory ratios, as captured by the model’s parametric structure.
In plan exploration using RAG-PE, RAG-CM’s predictions allow the system to converge on Pareto-optimal quality-performance plans with dramatically fewer real hardware evaluations versus exhaustive search.

Such integration supports system design decisions (e.g., selecting Top-K, index types, batch sizes) and algorithm–hardware co-optimization, which are otherwise infeasible to evaluate exhaustively.

5. Quality–Performance Co-Optimization and Significance

RAG-CM, as the cost estimation pillar, makes possible the decoupling of generation quality from system performance in RAG pipeline design. By formalizing the mapping from high-level RAG-IR to concrete performance (latency, cost, throughput), it enables:

Efficient, rapid exploration of large configuration spaces,
Systematic navigation of quality–cost tradeoffs,
Flexible adaptation to heterogeneous hardware and scaling regimes.

As substantiated by the paper, combining RAG-IR, RAG-CM, and RAG-PE as the three pillars establishes a foundation for repeatable, scalable, and hardware-aware optimization in complex vector database–driven RAG systems.

6. Future Perspectives and Potential Limitations

While RAG-CM is integral to real-world RAG deployment engineering, several challenges and avenues for future work remain:

Scaling analytical and ML-based models to novel hardware or algorithmic primitives may require ongoing profiling and model updates.
As configuration spaces grow (e.g., for distributed, hybrid, or multi-modal RAG systems), collecting training data for ML models or maintaining accurate analytical parameterizations can become expensive.
Extending RAG-CM to robustly support speculative execution, cross-node caching, or highly dynamic retrieval-generation couplings will further generalize its applicability.

Nevertheless, by providing a general-purpose methodology for converting RAG-IR configurations to performance predictions, RAG-CM establishes a core paradigm for RAG system engineering and quality-performance co-optimization (Jiang, 23 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

RAG-Stack: Co-Optimizing RAG Quality and Performance From the Vector Database Perspective (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RAG-CM.