CoEdge-RAG Hierarchical Edge Framework
- CoEdge-RAG is a hierarchical framework that enables real-time retrieval-augmented generation on resource-constrained edge devices while ensuring data privacy.
- It leverages a PPO-driven online query identifier, dynamic inter-node scheduling, and intra-node convex optimization for efficient model and resource allocation.
- Evaluations show significant improvements in quality and latency, with up to 91.39% gain in model allocation efficiency and minimal query latency.
CoEdge-RAG is a hierarchical computational framework enabling Retrieval-Augmented Generation (RAG) for LLMs operating collaboratively across multiple resource-constrained edge devices. The framework is motivated by the dual requirements for real-time responsiveness and data privacy, promoting localized inference while exploiting distributed knowledge and heterogeneous compute resources. Fundamental components include online query identification using Proximal Policy Optimization (PPO), dynamic inter-node scheduling, and intra-node query/model/memory allocation via online convex optimization.
1. Architectural Foundations
CoEdge-RAG comprises three core layers:
Global Coordinator:
The coordinator manages batches of user queries, encoding each query into an embedding . An Online Query Identifier, realized as a lightweight PPO-based policy network, infers for each query a soft assignment vector , representing the estimated semantic match of the query to each of edge nodes' localized corpora. Given node-specific processing capacities, the coordinator invokes an inter-node scheduler to allocate queries proportionally: each query receives a node assignment .
Edge Node Processing:
Each node receives assigned queries, performs private top- vector search retrieval over its local index, and constructs RAG prompts by concatenating retrieved content with the original query. Local model pools () comprising various LLMs (e.g., LLaMA-1B/3B/8B, Qwen) are managed across available GPUs. An intra-node scheduler solves an online convex program to determine model selection (), GPU memory allocation (), and query batching to satisfy latency Service-Level Objectives (SLOs).
Feedback Loop:
Generated answers are scored against references from a high-quality cloud LLM using a composite metric (ROUGE-L and BERTScore). These scores are used to update the PPO policy in the coordinator, completing the closed optimization loop.
Data flow progresses through four stages: query embedding → policy network () → inter-node scheduler () → edge node generation → scoring & feedback.
2. Online Query Identification (PPO-Driven Matching)
The query-to-node matching challenge is modeled as a contextual bandit: state space comprises dense query embeddings (), actions correspond to node assignments (), and rewards reflect the quality of generated answers relative to reference completions.
The PPO policy network operates without a critic, minimizing computational overhead. For each query, the composite reward
where and are ROUGE-L and BERTScore, respectively, and , is standardized within each batch. The objective function employs clipped surrogate gradients and entropy regularization: with importance weight , entropy regularizer , clipping parameter , and exploration parameter .
Empirical results indicate per-query inference latency at the matching step is approximately $0.02$ ms, and PPO-based allocation delivers up to 91.39% gain over bandit and random allocation schemes (Hong et al., 8 Nov 2025).
3. Dynamic Inter-Node Query Scheduling
Scheduling queries across heterogeneous edge nodes necessitates respecting latency and capacity constraints. Node capacity is profiled offline by fitting linear models: where is the latency SLO for slot .
The allocation algorithm sequentially samples node assignments for each query from , reassigning queries when a preferred node is at capacity. In cases of global overload, node capacities are temporarily increased proportionally. The scheduler maintains per-node query proportions , ensuring no node exceeds its modeled maximum under the current latency budget.
This adaptive mechanism demonstrates less than 7% degradation in ROUGE-L under severe domain skew, compared to greater than 10% for non-balanced baselines (Hong et al., 8 Nov 2025).
4. Intra-Node Model and Resource Scheduling via Online Convex Optimization
At the intra-node level, the framework allocates query batches across local LLMs and GPUs, solving an online convex program designed to maximize: where are pre-measured model quality scores.
Latency for each model-GPU pair is approximated by a quadratic fit: with empirical parameters. Reconfiguration overhead due to model switching is accounted for with linear constraints.
The optimization problem is subject to constraints on SLO latency, total GPU memory, and model startup/shutdown actions, and is efficiently solved via primal–dual mirror-descent updates: where is the Lagrangian with dual variables for each constraint.
Dynamic intra-node scheduling readily accommodates query load surges and latency changes, maintaining DropRate below 3% at 5s SLO and increasing answer quality by 5–15% over static deployments (Hong et al., 8 Nov 2025).
5. Experimental Evaluation and Comparative Results
Benchmark evaluations were conducted on four edge nodes (RTX 4090 GPUs) across six domains—biology, finance, law, sports, tech, and travel—with additional personalized conversation datasets. Key metrics include ROUGE, BLEU-4, METEOR, BERTScore, and DropRate.
| Query Allocation | DomainQA Quality Gain | PPC Quality Gain | DropRate |
|---|---|---|---|
| PPO (CoEdge-RAG) | 4.23–59.84% | 3.93–91.39% | <3% |
| Random (Baseline) | — | — | Higher |
| MAB (LinUCB, Baseline) | — | — | — |
Adaptive scheduling under varying SLOs demonstrates near-optimal allocation:
- Strict SLOs direct all queries to 1B parameter models.
- Moderate SLOs distribute ~70% to 3B, ~30% to 8B.
- Relaxed SLOs allocate the majority to 8B models.
A plausible implication is that CoEdge-RAG enables task-specific resource allocation without predefined workload characteristics, improving both throughput and answer quality.
6. Methodological Context, Trade-offs, and Future Directions
CoEdge-RAG unifies online RL-based query identification, capacity-aware inter-node balancing, and convex optimization for model/resource assignment.
Advantages:
- Scalable matching adapts to private, unknown corpora without data leakage.
- Inter-node balancing suppresses tail-latency under input or domain skew, reducing potential bottlenecks by up to 93%.
- Fast intra-node OCO scheduling maintains optimal trade-offs between latency and generation quality.
Limitations and Extensions:
- The linear node capacity model may be replaced by piecewise-linear or nonlinear fits for heterogeneous hardware.
- Convex solver reliability presumes stable latency models; robust or stochastic approaches may be required under high network variability.
- Scaling to large node counts may necessitate decentralized inter-node scheduling (e.g., tree aggregation).
- Support for multi-document or streaming queries implies extension to multi-vector embeddings and hierarchical RL.
- Integration of vision-augmented or multi-modal retrieval-augmented LLMs is feasible with offline latency/quality profiling.
This framework is directly suited for applications requiring privacy compliance, real-time responsiveness, and edge-local personalization, including offline QA, personalized assistants, and distributed API generation (Hong et al., 8 Nov 2025).