DisCEdge: Distributed Context for LLM

Updated 30 December 2025

DisCEdge is a distributed context management system for LLMs that tokenizes and replicates session context across geo-distributed edge nodes.
It minimizes latency and bandwidth usage by appending only new tokens and ensuring strong session consistency through asynchronous, client-driven replication.
Evaluations on commodity hardware reveal up to 14.46% latency reduction and 15% lower synchronization traffic, supporting privacy-aware, real-time LLM applications.

DisCEdge is a distributed context management system designed for efficiently serving LLM workloads at the edge. It addresses the challenges of maintaining conversational or session context for latency-sensitive and privacy-aware applications running on geo-distributed edge nodes, where standard LLM statelessness and naive client-side context management introduce excessive network latency and bandwidth overhead. DisCEdge implements tokenized, versioned storage and replication of user context, providing consistent, low-latency, and bandwidth-efficient LLM sessions across commodity hardware edge deployments (Malekabbasi et al., 27 Nov 2025).

1. System Architecture and Goals

DisCEdge is architected around a set of geo-distributed edge nodes, each comprising a Context Manager, a token-aware LLM inference engine (e.g., a modified [llama.cpp](https://www.emergentmind.com/topics/llama-cpp)), and a local replica of a geo-distributed key-value (KV) store such as FReD. Clients (mobile devices or edge compute devices) connect to their nearest edge node using geo-DNS or a registry, issuing completion requests that include user_id, session_id, and a monotonically increasing turn counter.

High-Level Goals

Low-latency inference: Minimize client-perceived round-trip time for sessional LLM inference.
Strong session consistency: Guarantee strict ordering and continuity of user context across roaming sessions.
Bandwidth efficiency: Reduce network load on both client-to-edge and edge-to-edge communication.
Transparent interface: Abstract distribution, presenting a single centralized LLM service facade to clients.
Commodity deployment: Operate efficiently on edge hardware without specialized accelerators (demonstrated on Nvidia Jetson TX2 and Apple M2).

Component Structure

Each edge node consists of:

Context Manager: Assigns and validates session parameters; maintains an in-memory, tokenized session state; enforces consistency semantics driven by the client turn counter.
LLM Service: Receives concatenated token context and raw prompt; performs inference with minimal re-tokenization.
KV Store Replica: Holds tokenized session context, supports peer-to-peer asynchronous replication, and enforces TTL-based eviction to reclaim resources.

2. Context Tokenization and Storage

DisCEdge transforms and stores user session context as integer token sequences rather than raw text, leading to efficiency gains in storage, transmission, and computation.

Tokenization Pipeline

On session initiation (first turn), the Context Manager tokenizes the raw user input using the LLM’s tokenizer, storing the result as initial context $C_0$ .
On subsequent turns, only the new prompt $P_k$ is tokenized ( $p_k$ ), and the cumulative context is updated via simple concatenation: $C_k = C_{k-1} \parallel p_k$ .

Data Structures

KV Store: Keyed by $(user\_id \| session\_id)$ and holding values { version \in \mathbb{N}, tokens = [int] }—a version counter and a sequence of tokens.
In-memory: The Context Manager caches “tokens” as a dynamic array keyed by session identifier.

Efficiency over Raw Text

Compactness: Token sequences require approximately half the bytes of raw text, reducing inter-node synchronization traffic by up to 15%.
Compute Reduction: Avoids $O(\|C\|)$ repeated re-tokenization per turn, decreasing inference response latency by up to 14.46%.
Append Efficiency: Context expansion per prompt is $O(1)$ amortized, since only new tokens are appended.

3. Distributed Context Replication Protocol

DisCEdge implements a distributed, client-driven replication protocol that guarantees session consistency semantics with minimal coordination.

Replication Overview

Writes: Updates (new turns) are written to the local KV-replica and then asynchronously propagated to peer nodes across the same model’s keygroup.
Reads: Upon request, the Context Manager ensures the local context version is at least as recent as the client’s expectation (turn counter). If stale, it backs off and retries, with a maximum retry limit.
Consistency: Maintains read-your-writes and monotonic reads (via client-incremented turn counters).

Pseudocode and State Machine

The protocol at the Context Manager is as follows:

$p_k$ 1

State transitions per session proceed from initialization ( $S_{\mathrm{INIT}}$ ) to serving ( $S_{\mathrm{SERVING}}$ ), with eventual eviction ( $S_{\mathrm{EVICTED}}$ ) on TTL expiry.

Complexity Analysis

Let $P_k$ 0 (context length), $P_k$ 1 (new tokens), $P_k$ 2 (number of replicas):

Write serialization: $P_k$ 3.
Network bandwidth per update: $P_k$ 4.
Replication latency: $P_k$ 5 (with $P_k$ 6 = network RTT, $P_k$ 7 = per-byte factor).

Amortized per-turn costs grow only with number of new tokens, not total context length.

4. Context Operations: API, Caching, and Eviction

DisCEdge’s programming interface and memory management are optimized for interactive, large-scale, multi-session LLM use cases.

API Semantics

StartSession: Optionally assigns and returns new user_id and session_id.
Query (turn $P_k$ 8): Requires identifiers, prior client_turn = k-1, and a new prompt. Operation blocks until the local context is current.
Session Update: Upon response, the client increments its turn.

Caching and Session Eviction

In-Memory Cache: Context Manager maintains a per-session hash-map of token sequences.
KV Store TTL: Enforces expiration and eviction of stale session state.
Anticipated Extensions: LRU cache policies and session idle timeouts are proposed for future versions, improving multi-tenant resource efficiency.

5. Quantitative Evaluation

DisCEdge was evaluated using an open-source prototype on heterogeneous, resource-constrained edge devices and clients.

Experimental Setup

Edge Nodes: Nvidia Jetson TX2 (ARM), Apple M2 (ARM), both deploying llama.cpp.
Client Device: Raspberry Pi 4 (ARM).
Network: Local LAN, sub-1 ms RTT.
Scenario: 9-turn dialogue using 4-bit quantized Qwen1.5-0.5B-Chat model.
Consistency Configuration: max_retries = 3, backoff = 10 ms.

Performance Metrics

Median end-to-end response time
LLM tokens-per-second throughput
Inter-node synchronization traffic per turn
Client-to-server request size

Key Results

Metric	TX2 (raw)	TX2 (tok)	M2 (raw)	M2 (tok)	Δ tok vs raw
Median latency (ms)	254.2	217.6	112.3	102.5	–14.46 % / –8.75 %
TPS (tokens/sec)	176	181	412	418	+2.85 % / +1.41 %
Sync bandwidth (KB/turn)	15.0	12.8	9.2	8.0	–15 % / –13.3 %

Metric (client side)	Client-side	DisCEdge	Δ
Median latency (ms)	232.4	219.0	–5.93 %
Client→server request size (KB)	~45	≈4	–90 %

Across all workloads and hardware, tokenized storage consistently improves both latency and bandwidth consumption relative to raw-text or client-side context management.

6. Limitations and Future Work

Scalability and Multi-Tenancy

Scaling limitations: KV layer contention emerges with high multi-tenant concurrency. LLM inference throughput sets the upper bound.
Replica set size: Synchronization overhead grows as $P_k$ 9 (gossip) or $p_k$ 0 (primary-fanout).

Fault Tolerance

KV partitioning: Context Manager may either block (strong consistency) or proceed with potentially stale state (weak consistency, configurable).
Future direction: Integration of quorum-based consistency for tunable CP/CA trade-offs.

Security and Privacy

Tokenization: While integer tokens eliminate direct raw PII, token streams still encode user data.
Future work: End-to-end token stream encryption and differential privacy protections are under consideration.

Enhancement Directions

Summarization/snapshotting: Periodically condense long context histories prior to token storage to control token sequence growth.
KV-cache integration: Direct manipulation of the LLM’s internal attention KV cache is proposed to bypass external token sequence maintenance.
Predictive handovers: Proactive context replication to likely next edge nodes, informed by user mobility.
Eviction policies: Exploration of hybrid TTL and LRU for efficient memory management.

DisCEdge exemplifies an efficient approach to distributed LLM context management at the edge, leveraging tokenized storage, client-driven consistency, and decentralized KV replication to prioritize low latency, bandwidth efficiency, and seamless user experience (Malekabbasi et al., 27 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DisCEdge: Distributed Context Management for Large Language Models at the Edge (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DisCEdge.

DisCEdge: Distributed Context for LLM

1. System Architecture and Goals

High-Level Goals

Component Structure

2. Context Tokenization and Storage

Tokenization Pipeline

Data Structures

Efficiency over Raw Text

3. Distributed Context Replication Protocol

Replication Overview

Pseudocode and State Machine

Complexity Analysis

4. Context Operations: API, Caching, and Eviction

API Semantics

Caching and Session Eviction

5. Quantitative Evaluation

Experimental Setup

Performance Metrics

Key Results

6. Limitations and Future Work

Scalability and Multi-Tenancy

Fault Tolerance

Security and Privacy

Enhancement Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DisCEdge: Distributed Context for LLM

1. System Architecture and Goals

High-Level Goals

Component Structure

2. Context Tokenization and Storage

Tokenization Pipeline

Data Structures

Efficiency over Raw Text

3. Distributed Context Replication Protocol

Replication Overview

Pseudocode and State Machine

Complexity Analysis

4. Context Operations: API, Caching, and Eviction

API Semantics

Caching and Session Eviction

5. Quantitative Evaluation

Experimental Setup

Performance Metrics

Key Results

6. Limitations and Future Work

Scalability and Multi-Tenancy

Fault Tolerance

Security and Privacy

Enhancement Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research