EdgeRAG: Efficient RAG on Edge Devices
- EdgeRAG is a retrieval-augmented generation system that minimizes stored embeddings through aggressive pruning and on-demand generation.
- It leverages adaptive caching and latency-aware pre-computation to meet strict memory and compute constraints on edge devices.
- Empirical evaluations show EdgeRAG maintaining retrieval and generation quality within a 5% margin compared to traditional methods while cutting latency substantially.
EdgeRAG is a retrieval-augmented generation (RAG) system designed for deployment on edge devices with strict limits on memory and computation. It addresses the prohibitive storage and latency constraints of traditional RAG, particularly under the two-level IVF (inverted file) index paradigm, by aggressively pruning stored document embeddings and leveraging on-demand embedding generation, selectively caching, and latency-aware cluster-level pre-computation. EdgeRAG demonstrates substantial reductions in tail latency and memory footprint while maintaining retrieval and generation quality on par with standard IVF and flat index systems within a 5% margin, as shown on BEIR datasets running on resource-constrained hardware (Seemakhupt et al., 2024).
1. System Architecture and Data Flow
EdgeRAG adopts a multi-stage pipeline that combines offline and online resource-aware index management with a tightly integrated retrieval and generation workflow:
- Indexer (offline): The system parses the corpus into data chunks (≈2 kB per chunk), computes embeddings via a neural model, and clusters these into centroids with -means. For each cluster , the total inference latency to generate all document embeddings is estimated using
where is measured in tokens/sec. If exceeds the SLO threshold (e.g., 1–1.5 s), all embeddings for are pre-computed and stored; otherwise, they are pruned from storage and generated on demand.
- Retriever (online, per query): Queries are embedded, then a first-level centroid search identifies the top- candidate clusters. For each cluster, document embeddings are fetched (from flash or cache), or generated if absent. A cost-aware least-frequently-used (LFU) policy governs the adaptive cache. The selected embeddings are then ranked for similarity against the query, and the top- are presented to the generation model.
- Cache Manager: Maintains and adapts an in-memory store of cluster embeddings. Caching and eviction decisions are based on cluster embedding generation cost and access frequency, dynamically tuned via an algorithm that raises or lowers the cache admission threshold, , to meet latency service-level objectives.
This architecture enables all key BEIR datasets to operate within an 8 GB memory cap on edge hardware (e.g. Nvidia Jetson Orin Nano), with the full text index partitioned into small numbers of pre-compute, cache-held, and on-demand clusters.
2. Index Pruning, Embedding Generation, and Pre-Computation Strategy
The core design innovation is the minimization of stored embeddings:
- Embedding Clustering and Pruning: Document embeddings are grouped into clusters via the -means objective:
For each cluster , if , the cluster’s embeddings are pre-computed in the index; otherwise, they are discarded and recreated on-the-fly during queries.
- Latency Modelling: The per-cluster generation latency is explicitly modeled to inform storage vs. recomputation trade-offs. Pre-computed clusters typically account for only 15–25% of the total embedding storage in large-scale benchmarks, reducing the static memory overhead proportionally.
- Adaptive Caching: Remaining cluster embeddings are eligible for caching in DRAM, with a cache occupancy typically set to 7% of system RAM (–0.5 GB on an 8 GB device). Weighted LFU caching, with costs defined by frequency, evicts the least cost-effective clusters as new ones are generated.
3. Retrieval, Caching, and Query Processing Algorithms
The retrieval pipeline is governed by three primary algorithms:
- Offline Selective Index Storage: Clusters exceeding the SLO threshold are flagged for embedding pre-computation. Metadata (centroids, counts, flags) is persistently stored.
- Online Cost-Aware LFU Replacement: When a cluster is accessed, the system checks for its presence in pre-computed storage or cache; otherwise, it computes the embeddings and optionally inserts them into cache, evicting the lowest weighted LFU cluster if necessary.
- Dynamic Minimum-Latency Threshold Adaptation: After each query, if a cache miss leads to sub-threshold latency, increases to encourage caching more expensive clusters; if not, is reduced. A moving average tracks SLO attainment.
Core Algorithmic Steps
| Component | Input/Trigger | Action |
|---|---|---|
| Indexer | Corpus partitions | Compute clusters, pre-compute heavy clusters |
| Retriever | Query | Embed, cluster search, load/generate/embed, cache |
| Cache Manager | Cache access/miss | Update counters, cost-aware LFU eviction/insertion |
| Update | Query latency feedback | Adjust cache threshold in adaptive loop |
4. Memory Footprint, Latency, and Empirical Evaluation
The memory usage under EdgeRAG is analytically decomposed as:
- Centroids + Cluster Metadata: (metadata), bytes for typical (512–1,024) and , (float32).
- Precomputed Clusters: , where is the precompute fraction.
- Cache: .
On BEIR's largest datasets (nq, fever, hotpotqa; million, GB), EdgeRAG achieves –$0.25$, and , allowing all experiments to run under 8 GB, while the baseline IVF index would not fit without thrashing.
Performance Metrics:
- Latency: EdgeRAG reduces average retrieval plus first-token latency (TTFT) from 850 ms (IVF baseline) to 475 ms, with 95th percentile latency cut from 2,200 ms to 350 ms. Speedups reach up to on large sets.
- Quality: Recall@10 with EdgeRAG matches IVF within 1% ($0.70$ vs. $0.72$, flat index). Generation F1 (GPT-4o) is within 5% of the flat index.
- Memory: scidocs (3.6k, 113 MB) fits fully in memory; nq/hotpotqa/fever ( GB) require aggressive pruning.
5. Design Trade-Offs and Hyperparameter Sensitivity
Key design parameters control the storage/latency/quality trade-offs:
- Cluster count : More clusters mean smaller per-cluster latency and fewer precomputes, but higher centroid-level memory for metadata. Practical regimes are $512$–$1,024$ centroids.
- SLO threshold : Tighter SLO means more clusters are pre-computed, increasing storage cost. Larger pushes more computation on demand, risking latency spikes when many queries hit heavy clusters.
- Cache Capacity : Higher cache percentages reduce recomputation and improve latency, but even modest settings (5–10% of DRAM) retain over 80% of reusable clusters in practice and reduce cache miss rate by .
- Dynamic adaptation: It is critical to tune such that expensive clusters are prioritized for caching. Too permissive a setting fills the cache with easy clusters; too restrictive loses amortization on expensive clusters. Adaptive feedback-loop tuning maintains hit rates at minimal latency increase.
In practice, EdgeRAG achieves SLO-compliant operation, sub-8 GB total memory, and robust retrieval/generation metrics by combining these trade-offs (Seemakhupt et al., 2024).
6. Comparative Context and Impact
EdgeRAG is distinctly optimized for edge devices with limited DRAM and compute budgets. Unlike offline-pruned IVF indexes or dense full-embedding approaches, EdgeRAG’s three-mode management (pre-compute, prune, adaptive cache) is directly informed by per-cluster generation latency and memory limits, ensuring scalable operation across a wide range of corpus sizes and query patterns on commodity edge hardware.
By achieving $1.8$– speedup in TTFT, a reduction in 95th percentile tail latency, and accuracy within 5% of full indices, EdgeRAG allows robust retrieval-augmented generation on devices with only several gigabytes of RAM and without relying on continuous cloud access. This enables new deployment scenarios for RAG—such as fully local QA, on-device summarization, or privacy-sensitive knowledge workflows—where mobile or IoT computation is essential and cloud connectivity is unpredictable or prohibitive (Seemakhupt et al., 2024).
A plausible implication is that EdgeRAG’s adaptive index surface—the blend of pre-compute, dynamic cache, and on-demand generation—can serve as a template for future resource-aware document retrieval systems in decentralized or federated settings, especially where storage and latency budgets are non-negotiable.