DisCEdge: Distributed Context for LLM
- DisCEdge is a distributed context management system for LLMs that tokenizes and replicates session context across geo-distributed edge nodes.
- It minimizes latency and bandwidth usage by appending only new tokens and ensuring strong session consistency through asynchronous, client-driven replication.
- Evaluations on commodity hardware reveal up to 14.46% latency reduction and 15% lower synchronization traffic, supporting privacy-aware, real-time LLM applications.
DisCEdge is a distributed context management system designed for efficiently serving LLM workloads at the edge. It addresses the challenges of maintaining conversational or session context for latency-sensitive and privacy-aware applications running on geo-distributed edge nodes, where standard LLM statelessness and naive client-side context management introduce excessive network latency and bandwidth overhead. DisCEdge implements tokenized, versioned storage and replication of user context, providing consistent, low-latency, and bandwidth-efficient LLM sessions across commodity hardware edge deployments (Malekabbasi et al., 27 Nov 2025).
1. System Architecture and Goals
DisCEdge is architected around a set of geo-distributed edge nodes, each comprising a Context Manager, a token-aware LLM inference engine (e.g., a modified [llama.cpp](https://www.emergentmind.com/topics/llama-cpp)), and a local replica of a geo-distributed key-value (KV) store such as FReD. Clients (mobile devices or edge compute devices) connect to their nearest edge node using geo-DNS or a registry, issuing completion requests that include user_id, session_id, and a monotonically increasing turn counter.
High-Level Goals
- Low-latency inference: Minimize client-perceived round-trip time for sessional LLM inference.
- Strong session consistency: Guarantee strict ordering and continuity of user context across roaming sessions.
- Bandwidth efficiency: Reduce network load on both client-to-edge and edge-to-edge communication.
- Transparent interface: Abstract distribution, presenting a single centralized LLM service facade to clients.
- Commodity deployment: Operate efficiently on edge hardware without specialized accelerators (demonstrated on Nvidia Jetson TX2 and Apple M2).
Component Structure
Each edge node consists of:
- Context Manager: Assigns and validates session parameters; maintains an in-memory, tokenized session state; enforces consistency semantics driven by the client turn counter.
- LLM Service: Receives concatenated token context and raw prompt; performs inference with minimal re-tokenization.
- KV Store Replica: Holds tokenized session context, supports peer-to-peer asynchronous replication, and enforces TTL-based eviction to reclaim resources.
2. Context Tokenization and Storage
DisCEdge transforms and stores user session context as integer token sequences rather than raw text, leading to efficiency gains in storage, transmission, and computation.
Tokenization Pipeline
- On session initiation (first turn), the Context Manager tokenizes the raw user input using the LLM’s tokenizer, storing the result as initial context .
- On subsequent turns, only the new prompt is tokenized (), and the cumulative context is updated via simple concatenation: .
Data Structures
- KV Store: Keyed by %%%%4%%%% and holding values
{ version \in \mathbb{N}, tokens = [int] }—a version counter and a sequence of tokens. - In-memory: The Context Manager caches “tokens” as a dynamic array keyed by session identifier.
Efficiency over Raw Text
- Compactness: Token sequences require approximately half the bytes of raw text, reducing inter-node synchronization traffic by up to 15%.
- Compute Reduction: Avoids repeated re-tokenization per turn, decreasing inference response latency by up to 14.46%.
- Append Efficiency: Context expansion per prompt is amortized, since only new tokens are appended.
3. Distributed Context Replication Protocol
DisCEdge implements a distributed, client-driven replication protocol that guarantees session consistency semantics with minimal coordination.
Replication Overview
- Writes: Updates (new turns) are written to the local KV-replica and then asynchronously propagated to peer nodes across the same model’s keygroup.
- Reads: Upon request, the Context Manager ensures the local context version is at least as recent as the client’s expectation (turn counter). If stale, it backs off and retries, with a maximum retry limit.
- Consistency: Maintains read-your-writes and monotonic reads (via client-incremented turn counters).
Pseudocode and State Machine
The protocol at the Context Manager is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
OnRequest(user, session, client_turn, raw_prompt):
retries ← 0
while local_version < client_turn and retries < R_max:
sleep(backoff_ms)
(v’, T’) ← KV.read((user,session))
local_version ← v’
T ← T’
retries++
if local_version < client_turn:
return error("consistency failure")
p ← tokenize(raw_prompt)
tok_context ← T ∥ p
response ← LLM.generate(tok_context)
spawn AsyncUpdate(user, session, local_version+1, p)
return response
AsyncUpdate(user,session,new_ver,p):
new_tokens ← read_local_context(user,session).T ∥ p
KV.write((user,session) → (new_ver, new_tokens)) |
State transitions per session proceed from initialization () to serving (), with eventual eviction () on TTL expiry.
Complexity Analysis
Let (context length), (new tokens), (number of replicas):
- Write serialization: .
- Network bandwidth per update: .
- Replication latency: (with = network RTT, = per-byte factor).
Amortized per-turn costs grow only with number of new tokens, not total context length.
4. Context Operations: API, Caching, and Eviction
DisCEdge’s programming interface and memory management are optimized for interactive, large-scale, multi-session LLM use cases.
API Semantics
- StartSession: Optionally assigns and returns new
user_idandsession_id. - Query (turn ): Requires identifiers, prior
client_turn = k-1, and a new prompt. Operation blocks until the local context is current. - Session Update: Upon response, the client increments its turn.
Caching and Session Eviction
- In-Memory Cache: Context Manager maintains a per-session hash-map of token sequences.
- KV Store TTL: Enforces expiration and eviction of stale session state.
- Anticipated Extensions: LRU cache policies and session idle timeouts are proposed for future versions, improving multi-tenant resource efficiency.
5. Quantitative Evaluation
DisCEdge was evaluated using an open-source prototype on heterogeneous, resource-constrained edge devices and clients.
Experimental Setup
- Edge Nodes: Nvidia Jetson TX2 (ARM), Apple M2 (ARM), both deploying llama.cpp.
- Client Device: Raspberry Pi 4 (ARM).
- Network: Local LAN, sub-1 ms RTT.
- Scenario: 9-turn dialogue using 4-bit quantized Qwen1.5-0.5B-Chat model.
- Consistency Configuration:
max_retries = 3,backoff = 10ms.
Performance Metrics
- Median end-to-end response time
- LLM tokens-per-second throughput
- Inter-node synchronization traffic per turn
- Client-to-server request size
Key Results
| Metric | TX2 (raw) | TX2 (tok) | M2 (raw) | M2 (tok) | Δ tok vs raw |
|---|---|---|---|---|---|
| Median latency (ms) | 254.2 | 217.6 | 112.3 | 102.5 | –14.46 % / –8.75 % |
| TPS (tokens/sec) | 176 | 181 | 412 | 418 | +2.85 % / +1.41 % |
| Sync bandwidth (KB/turn) | 15.0 | 12.8 | 9.2 | 8.0 | –15 % / –13.3 % |
| Metric (client side) | Client-side | DisCEdge | Δ |
|---|---|---|---|
| Median latency (ms) | 232.4 | 219.0 | –5.93 % |
| Client→server request size (KB) | ~45 | ≈4 | –90 % |
Across all workloads and hardware, tokenized storage consistently improves both latency and bandwidth consumption relative to raw-text or client-side context management.
6. Limitations and Future Work
Scalability and Multi-Tenancy
- Scaling limitations: KV layer contention emerges with high multi-tenant concurrency. LLM inference throughput sets the upper bound.
- Replica set size: Synchronization overhead grows as (gossip) or (primary-fanout).
Fault Tolerance
- KV partitioning: Context Manager may either block (strong consistency) or proceed with potentially stale state (weak consistency, configurable).
- Future direction: Integration of quorum-based consistency for tunable CP/CA trade-offs.
Security and Privacy
- Tokenization: While integer tokens eliminate direct raw PII, token streams still encode user data.
- Future work: End-to-end token stream encryption and differential privacy protections are under consideration.
Enhancement Directions
- Summarization/snapshotting: Periodically condense long context histories prior to token storage to control token sequence growth.
- KV-cache integration: Direct manipulation of the LLM’s internal attention KV cache is proposed to bypass external token sequence maintenance.
- Predictive handovers: Proactive context replication to likely next edge nodes, informed by user mobility.
- Eviction policies: Exploration of hybrid TTL and LRU for efficient memory management.
DisCEdge exemplifies an efficient approach to distributed LLM context management at the edge, leveraging tokenized storage, client-driven consistency, and decentralized KV replication to prioritize low latency, bandwidth efficiency, and seamless user experience (Malekabbasi et al., 27 Nov 2025).