Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoEdge-RAG Hierarchical Edge Framework

Updated 15 November 2025
  • CoEdge-RAG is a hierarchical framework that enables real-time retrieval-augmented generation on resource-constrained edge devices while ensuring data privacy.
  • It leverages a PPO-driven online query identifier, dynamic inter-node scheduling, and intra-node convex optimization for efficient model and resource allocation.
  • Evaluations show significant improvements in quality and latency, with up to 91.39% gain in model allocation efficiency and minimal query latency.

CoEdge-RAG is a hierarchical computational framework enabling Retrieval-Augmented Generation (RAG) for LLMs operating collaboratively across multiple resource-constrained edge devices. The framework is motivated by the dual requirements for real-time responsiveness and data privacy, promoting localized inference while exploiting distributed knowledge and heterogeneous compute resources. Fundamental components include online query identification using Proximal Policy Optimization (PPO), dynamic inter-node scheduling, and intra-node query/model/memory allocation via online convex optimization.

1. Architectural Foundations

CoEdge-RAG comprises three core layers:

Global Coordinator:

The coordinator manages batches of user queries, encoding each query qiq_i into an embedding eite_i^t. An Online Query Identifier, realized as a lightweight PPO-based policy network, infers for each query a soft assignment vector sit=(si1t,...,siNt)s_i^t = (s_{i1}^t, ..., s_{iN}^t), representing the estimated semantic match of the query to each of NN edge nodes' localized corpora. Given node-specific processing capacities, the coordinator invokes an inter-node scheduler to allocate queries proportionally: each query receives a node assignment aita_i^t.

Edge Node Processing:

Each node receives assigned queries, performs private top-kk vector search retrieval over its local index, and constructs RAG prompts by concatenating retrieved content with the original query. Local model pools (Mn\mathcal{M}_n) comprising various LLMs (e.g., LLaMA-1B/3B/8B, Qwen) are managed across available GPUs. An intra-node scheduler solves an online convex program to determine model selection (pmnktp_{mnk}^t), GPU memory allocation (RmnktR_{mnk}^t), and query batching to satisfy latency Service-Level Objectives (SLOs).

Feedback Loop:

Generated answers are scored against references from a high-quality cloud LLM using a composite metric (ROUGE-L and BERTScore). These scores are used to update the PPO policy in the coordinator, completing the closed optimization loop.

Data flow progresses through four stages: query embedding → policy network (sits_i^t) → inter-node scheduler (aita_i^t) → edge node generation → scoring & feedback.

2. Online Query Identification (PPO-Driven Matching)

The query-to-node matching challenge is modeled as a contextual bandit: state space comprises dense query embeddings (eite_i^t), actions correspond to node assignments (aita_i^t), and rewards reflect the quality of generated answers relative to reference completions.

The PPO policy network πθ(ae)\pi_\theta(a|e) operates without a critic, minimizing computational overhead. For each query, the composite reward

fit=α1fi,Rt+α2fi,Bt,f_i^t = \alpha_1 f_{i,R}^t + \alpha_2 f_{i,B}^t,

where fi,Rtf_{i,R}^t and fi,Btf_{i,B}^t are ROUGE-L and BERTScore, respectively, and α1+α2=1\alpha_1 + \alpha_2 = 1, is standardized within each batch. The objective function employs clipped surrogate gradients and entropy regularization: LCLIP(θ)=Ei,t[min(ρitfˉit,clip(ρit,1ϵ,1+ϵ)fˉit)]+βH(πθ),L^{\rm CLIP}(\theta) = \mathbb{E}_{i,t} \left[ \min\left(\rho_i^t \bar f_i^t, \mathrm{clip}(\rho_i^t, 1-\epsilon, 1+\epsilon) \bar f_i^t \right) \right] + \beta H(\pi_\theta), with importance weight ρit=πθ(aiteit)πθold(aiteit)\rho_i^t = \frac{\pi_\theta(a_i^t|e_i^t)}{\pi_{\theta_{\rm old}}(a_i^t|e_i^t)}, entropy regularizer HH, clipping parameter ϵ\epsilon, and exploration parameter β\beta.

Empirical results indicate per-query inference latency at the matching step is approximately $0.02$ ms, and PPO-based allocation delivers up to 91.39% gain over bandit and random allocation schemes (Hong et al., 8 Nov 2025).

3. Dynamic Inter-Node Query Scheduling

Scheduling queries across heterogeneous edge nodes necessitates respecting latency and capacity constraints. Node capacity is profiled offline by fitting linear models: Cn(Lt)=knLt+bn,C_n(L^t) = k_n L^t + b_n, where LtL^t is the latency SLO for slot tt.

The allocation algorithm sequentially samples node assignments for each query from sits_i^t, reassigning queries when a preferred node is at capacity. In cases of global overload, node capacities are temporarily increased proportionally. The scheduler maintains per-node query proportions pntp_n^t, ensuring no node exceeds its modeled maximum under the current latency budget.

This adaptive mechanism demonstrates less than 7% degradation in ROUGE-L under severe domain skew, compared to greater than 10% for non-balanced baselines (Hong et al., 8 Nov 2025).

4. Intra-Node Model and Resource Scheduling via Online Convex Optimization

At the intra-node level, the framework allocates query batches across local LLMs and GPUs, solving an online convex program designed to maximize: maxp,Rm,kpmnktQmn,\max_{p, R} \sum_{m, k} p_{mnk}^t Q_{mn}, where QmnQ_{mn} are pre-measured model quality scores.

Latency for each model-GPU pair is approximated by a quadratic fit: L~mnkt=(apmnktBtbRmnkt)2+cpmnktBt+dRmnkt+e+ΔT,\widetilde{L}_{mnk}^t = (a p_{mnk}^t B^t - b R_{mnk}^t)^2 + c p_{mnk}^t B^t + d R_{mnk}^t + e + \Delta T, with empirical parameters. Reconfiguration overhead due to model switching is accounted for with linear constraints.

The optimization problem is subject to constraints on SLO latency, total GPU memory, and model startup/shutdown actions, and is efficiently solved via primal–dual mirror-descent updates: pt+1=Πp=1,p0{ptηpL},Rt+1=Π0R1,mR1{RtηRL},p^{t+1} = \Pi_{\sum p=1, p\geq0} \{p^t - \eta \nabla_p \mathcal{L} \}, \quad R^{t+1} = \Pi_{0 \leq R \leq 1, \sum_m R \leq 1} \{R^t - \eta \nabla_R \mathcal{L}\}, where L\mathcal{L} is the Lagrangian with dual variables for each constraint.

Dynamic intra-node scheduling readily accommodates query load surges and latency changes, maintaining DropRate below 3% at 5s SLO and increasing answer quality by 5–15% over static deployments (Hong et al., 8 Nov 2025).

5. Experimental Evaluation and Comparative Results

Benchmark evaluations were conducted on four edge nodes (RTX 4090 GPUs) across six domains—biology, finance, law, sports, tech, and travel—with additional personalized conversation datasets. Key metrics include ROUGE, BLEU-4, METEOR, BERTScore, and DropRate.

Query Allocation DomainQA Quality Gain PPC Quality Gain DropRate
PPO (CoEdge-RAG) 4.23–59.84% 3.93–91.39% <3%
Random (Baseline) Higher
MAB (LinUCB, Baseline)

Adaptive scheduling under varying SLOs demonstrates near-optimal allocation:

  • Strict SLOs direct all queries to 1B parameter models.
  • Moderate SLOs distribute ~70% to 3B, ~30% to 8B.
  • Relaxed SLOs allocate the majority to 8B models.

A plausible implication is that CoEdge-RAG enables task-specific resource allocation without predefined workload characteristics, improving both throughput and answer quality.

6. Methodological Context, Trade-offs, and Future Directions

CoEdge-RAG unifies online RL-based query identification, capacity-aware inter-node balancing, and convex optimization for model/resource assignment.

Advantages:

  • Scalable matching adapts to private, unknown corpora without data leakage.
  • Inter-node balancing suppresses tail-latency under input or domain skew, reducing potential bottlenecks by up to 93%.
  • Fast intra-node OCO scheduling maintains optimal trade-offs between latency and generation quality.

Limitations and Extensions:

  • The linear node capacity model Cn(L)C_n(L) may be replaced by piecewise-linear or nonlinear fits for heterogeneous hardware.
  • Convex solver reliability presumes stable latency models; robust or stochastic approaches may be required under high network variability.
  • Scaling to large node counts may necessitate decentralized inter-node scheduling (e.g., tree aggregation).
  • Support for multi-document or streaming queries implies extension to multi-vector embeddings and hierarchical RL.
  • Integration of vision-augmented or multi-modal retrieval-augmented LLMs is feasible with offline latency/quality profiling.

This framework is directly suited for applications requiring privacy compliance, real-time responsiveness, and edge-local personalization, including offline QA, personalized assistants, and distributed API generation (Hong et al., 8 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoEdge-RAG.