Papers
Topics
Authors
Recent
Search
2000 character limit reached

EdgeRAG: Efficient RAG on Edge Devices

Updated 7 January 2026
  • EdgeRAG is a retrieval-augmented generation system that minimizes stored embeddings through aggressive pruning and on-demand generation.
  • It leverages adaptive caching and latency-aware pre-computation to meet strict memory and compute constraints on edge devices.
  • Empirical evaluations show EdgeRAG maintaining retrieval and generation quality within a 5% margin compared to traditional methods while cutting latency substantially.

EdgeRAG is a retrieval-augmented generation (RAG) system designed for deployment on edge devices with strict limits on memory and computation. It addresses the prohibitive storage and latency constraints of traditional RAG, particularly under the two-level IVF (inverted file) index paradigm, by aggressively pruning stored document embeddings and leveraging on-demand embedding generation, selectively caching, and latency-aware cluster-level pre-computation. EdgeRAG demonstrates substantial reductions in tail latency and memory footprint while maintaining retrieval and generation quality on par with standard IVF and flat index systems within a 5% margin, as shown on BEIR datasets running on resource-constrained hardware (Seemakhupt et al., 2024).

1. System Architecture and Data Flow

EdgeRAG adopts a multi-stage pipeline that combines offline and online resource-aware index management with a tightly integrated retrieval and generation workflow:

  • Indexer (offline): The system parses the corpus into data chunks (≈2 kB per chunk), computes embeddings via a neural model, and clusters these into KK centroids with kk-means. For each cluster CjC_j, the total inference latency to generate all document embeddings is estimated using

Lgen(Cj)=xCjlen(x)RgenL_{\mathrm{gen}}(C_j) = \frac{\sum_{x\in C_j} \mathrm{len}(x)}{R_{\mathrm{gen}}}

where RgenR_{\mathrm{gen}} is measured in tokens/sec. If Lgen(Cj)L_{\mathrm{gen}}(C_j) exceeds the SLO threshold τ\tau (e.g., 1–1.5 s), all embeddings for CjC_j are pre-computed and stored; otherwise, they are pruned from storage and generated on demand.

  • Retriever (online, per query): Queries are embedded, then a first-level centroid search identifies the top-pp candidate clusters. For each cluster, document embeddings are fetched (from flash or cache), or generated if absent. A cost-aware least-frequently-used (LFU) policy governs the adaptive cache. The selected embeddings are then ranked for similarity against the query, and the top-kk are presented to the generation model.
  • Cache Manager: Maintains and adapts an in-memory store of cluster embeddings. Caching and eviction decisions are based on cluster embedding generation cost and access frequency, dynamically tuned via an algorithm that raises or lowers the cache admission threshold, τmin\tau_{\min}, to meet latency service-level objectives.

This architecture enables all key BEIR datasets to operate within an 8 GB memory cap on edge hardware (e.g. Nvidia Jetson Orin Nano), with the full text index partitioned into small numbers of pre-compute, cache-held, and on-demand clusters.

2. Index Pruning, Embedding Generation, and Pre-Computation Strategy

The core design innovation is the minimization of stored embeddings:

  • Embedding Clustering and Pruning: Document embeddings {xi}\{x_i\} are grouped into KK clusters via the kk-means objective:

{μj}j=1K=argminμ1,,μKi=1Nmin1jKxiμj2\{\mu_j\}_{j=1}^K = \arg\min_{\mu_1,\dots,\mu_K} \sum_{i=1}^N \min_{1\leq j\leq K} \|x_i - \mu_j\|^2

For each cluster CjC_j, if Lgen(Cj)>τL_{\mathrm{gen}}(C_j) > \tau, the cluster’s embeddings are pre-computed in the index; otherwise, they are discarded and recreated on-the-fly during queries.

  • Latency Modelling: The per-cluster generation latency is explicitly modeled to inform storage vs. recomputation trade-offs. Pre-computed clusters typically account for only 15–25% of the total embedding storage in large-scale benchmarks, reducing the static memory overhead proportionally.
  • Adaptive Caching: Remaining cluster embeddings are eligible for caching in DRAM, with a cache occupancy typically set to 7% of system RAM (c0.3c\approx 0.3–0.5 GB on an 8 GB device). Weighted LFU caching, with costs defined by Lgen×L_{\mathrm{gen}} \times frequency, evicts the least cost-effective clusters as new ones are generated.

3. Retrieval, Caching, and Query Processing Algorithms

The retrieval pipeline is governed by three primary algorithms:

  • Offline Selective Index Storage: Clusters exceeding the SLO threshold are flagged for embedding pre-computation. Metadata (centroids, counts, flags) is persistently stored.
  • Online Cost-Aware LFU Replacement: When a cluster is accessed, the system checks for its presence in pre-computed storage or cache; otherwise, it computes the embeddings and optionally inserts them into cache, evicting the lowest weighted LFU cluster if necessary.
  • Dynamic Minimum-Latency Threshold Adaptation: After each query, if a cache miss leads to sub-threshold latency, τmin\tau_{\min} increases to encourage caching more expensive clusters; if not, τmin\tau_{\min} is reduced. A moving average tracks SLO attainment.

Core Algorithmic Steps

Component Input/Trigger Action
Indexer Corpus partitions Compute clusters, pre-compute heavy clusters
Retriever Query Embed, cluster search, load/generate/embed, cache
Cache Manager Cache access/miss Update counters, cost-aware LFU eviction/insertion
τmin\tau_{\min} Update Query latency feedback Adjust cache threshold in adaptive loop

4. Memory Footprint, Latency, and Empirical Evaluation

The memory usage under EdgeRAG is analytically decomposed as:

  • Centroids + Cluster Metadata: K×d×b+K×K \times d \times b + K \times (metadata), K×3,072\approx K \times 3{,}072 bytes for typical KK (512–1,024) and d=768d=768, b=4b=4 (float32).
  • Precomputed Clusters: Mpre=ρNdbM_{\mathrm{pre}} = \rho N d b, where ρ\rho is the precompute fraction.
  • Cache: Mcachec(1ρ)NdbM_{\mathrm{cache}} \approx c \cdot (1-\rho) N d b.

On BEIR's largest datasets (nq, fever, hotpotqa; N>5N>5 million, >8>8 GB), EdgeRAG achieves ρ0.15\rho \approx 0.15–$0.25$, and c0.07c\approx0.07, allowing all experiments to run under 8 GB, while the baseline IVF index would not fit without thrashing.

Performance Metrics:

  • Latency: EdgeRAG reduces average retrieval plus first-token latency (TTFT) from 850 ms (IVF baseline) to 475 ms, with 95th percentile latency cut from 2,200 ms to 350 ms. Speedups reach up to 3.8×3.8\times on large sets.
  • Quality: Recall@10 with EdgeRAG matches IVF within 1% ($0.70$ vs. $0.72$, flat index). Generation F1 (GPT-4o) is within 5% of the flat index.
  • Memory: scidocs (3.6k, 113 MB) fits fully in memory; nq/hotpotqa/fever (>8>8 GB) require aggressive pruning.

5. Design Trade-Offs and Hyperparameter Sensitivity

Key design parameters control the storage/latency/quality trade-offs:

  • Cluster count KK: More clusters mean smaller per-cluster latency and fewer precomputes, but higher centroid-level memory for metadata. Practical regimes are $512$–$1,024$ centroids.
  • SLO threshold τ\tau: Tighter SLO means more clusters are pre-computed, increasing storage cost. Larger τ\tau pushes more computation on demand, risking latency spikes when many queries hit heavy clusters.
  • Cache Capacity cc: Higher cache percentages reduce recomputation and improve latency, but even modest settings (5–10% of DRAM) retain over 80% of reusable clusters in practice and reduce cache miss rate by 4×4\times.
  • Dynamic τmin\tau_{\min} adaptation: It is critical to tune τmin\tau_{\min} such that expensive clusters are prioritized for caching. Too permissive a setting fills the cache with easy clusters; too restrictive loses amortization on expensive clusters. Adaptive feedback-loop tuning maintains >70%>70\% hit rates at minimal latency increase.

In practice, EdgeRAG achieves SLO-compliant operation, sub-8 GB total memory, and robust retrieval/generation metrics by combining these trade-offs (Seemakhupt et al., 2024).

6. Comparative Context and Impact

EdgeRAG is distinctly optimized for edge devices with limited DRAM and compute budgets. Unlike offline-pruned IVF indexes or dense full-embedding approaches, EdgeRAG’s three-mode management (pre-compute, prune, adaptive cache) is directly informed by per-cluster generation latency and memory limits, ensuring scalable operation across a wide range of corpus sizes and query patterns on commodity edge hardware.

By achieving $1.8$–3.8×3.8\times speedup in TTFT, a >6×>6\times reduction in 95th percentile tail latency, and accuracy within 5% of full indices, EdgeRAG allows robust retrieval-augmented generation on devices with only several gigabytes of RAM and without relying on continuous cloud access. This enables new deployment scenarios for RAG—such as fully local QA, on-device summarization, or privacy-sensitive knowledge workflows—where mobile or IoT computation is essential and cloud connectivity is unpredictable or prohibitive (Seemakhupt et al., 2024).

A plausible implication is that EdgeRAG’s adaptive index surface—the blend of pre-compute, dynamic cache, and on-demand generation—can serve as a template for future resource-aware document retrieval systems in decentralized or federated settings, especially where storage and latency budgets are non-negotiable.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EdgeRAG System.