Papers
Topics
Authors
Recent
Search
2000 character limit reached

DisCEdge: Distributed Context for LLM

Updated 30 December 2025
  • DisCEdge is a distributed context management system for LLMs that tokenizes and replicates session context across geo-distributed edge nodes.
  • It minimizes latency and bandwidth usage by appending only new tokens and ensuring strong session consistency through asynchronous, client-driven replication.
  • Evaluations on commodity hardware reveal up to 14.46% latency reduction and 15% lower synchronization traffic, supporting privacy-aware, real-time LLM applications.

DisCEdge is a distributed context management system designed for efficiently serving LLM workloads at the edge. It addresses the challenges of maintaining conversational or session context for latency-sensitive and privacy-aware applications running on geo-distributed edge nodes, where standard LLM statelessness and naive client-side context management introduce excessive network latency and bandwidth overhead. DisCEdge implements tokenized, versioned storage and replication of user context, providing consistent, low-latency, and bandwidth-efficient LLM sessions across commodity hardware edge deployments (Malekabbasi et al., 27 Nov 2025).

1. System Architecture and Goals

DisCEdge is architected around a set of geo-distributed edge nodes, each comprising a Context Manager, a token-aware LLM inference engine (e.g., a modified [llama.cpp](https://www.emergentmind.com/topics/llama-cpp)), and a local replica of a geo-distributed key-value (KV) store such as FReD. Clients (mobile devices or edge compute devices) connect to their nearest edge node using geo-DNS or a registry, issuing completion requests that include user_id, session_id, and a monotonically increasing turn counter.

High-Level Goals

  • Low-latency inference: Minimize client-perceived round-trip time for sessional LLM inference.
  • Strong session consistency: Guarantee strict ordering and continuity of user context across roaming sessions.
  • Bandwidth efficiency: Reduce network load on both client-to-edge and edge-to-edge communication.
  • Transparent interface: Abstract distribution, presenting a single centralized LLM service facade to clients.
  • Commodity deployment: Operate efficiently on edge hardware without specialized accelerators (demonstrated on Nvidia Jetson TX2 and Apple M2).

Component Structure

Each edge node consists of:

  • Context Manager: Assigns and validates session parameters; maintains an in-memory, tokenized session state; enforces consistency semantics driven by the client turn counter.
  • LLM Service: Receives concatenated token context and raw prompt; performs inference with minimal re-tokenization.
  • KV Store Replica: Holds tokenized session context, supports peer-to-peer asynchronous replication, and enforces TTL-based eviction to reclaim resources.

2. Context Tokenization and Storage

DisCEdge transforms and stores user session context as integer token sequences rather than raw text, leading to efficiency gains in storage, transmission, and computation.

Tokenization Pipeline

  1. On session initiation (first turn), the Context Manager tokenizes the raw user input using the LLM’s tokenizer, storing the result as initial context C0C_0.
  2. On subsequent turns, only the new prompt PkP_k is tokenized (pkp_k), and the cumulative context is updated via simple concatenation: Ck=Ck1pkC_k = C_{k-1} \parallel p_k.

Data Structures

  • KV Store: Keyed by %%%%4%%%% and holding values { version \in \mathbb{N}, tokens = [int] }—a version counter and a sequence of tokens.
  • In-memory: The Context Manager caches “tokens” as a dynamic array keyed by session identifier.

Efficiency over Raw Text

  • Compactness: Token sequences require approximately half the bytes of raw text, reducing inter-node synchronization traffic by up to 15%.
  • Compute Reduction: Avoids O(C)O(\|C\|) repeated re-tokenization per turn, decreasing inference response latency by up to 14.46%.
  • Append Efficiency: Context expansion per prompt is O(1)O(1) amortized, since only new tokens are appended.

3. Distributed Context Replication Protocol

DisCEdge implements a distributed, client-driven replication protocol that guarantees session consistency semantics with minimal coordination.

Replication Overview

  • Writes: Updates (new turns) are written to the local KV-replica and then asynchronously propagated to peer nodes across the same model’s keygroup.
  • Reads: Upon request, the Context Manager ensures the local context version is at least as recent as the client’s expectation (turn counter). If stale, it backs off and retries, with a maximum retry limit.
  • Consistency: Maintains read-your-writes and monotonic reads (via client-incremented turn counters).

Pseudocode and State Machine

The protocol at the Context Manager is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
OnRequest(user, session, client_turn, raw_prompt):
    retries  0
    while local_version < client_turn and retries < R_max:
        sleep(backoff_ms)
        (v, T)  KV.read((user,session))
        local_version  v
        T  T
        retries++
    if local_version < client_turn:
        return error("consistency failure")
    p  tokenize(raw_prompt)
    tok_context  T  p
    response  LLM.generate(tok_context)
    spawn AsyncUpdate(user, session, local_version+1, p)
    return response

AsyncUpdate(user,session,new_ver,p):
    new_tokens  read_local_context(user,session).T  p
    KV.write((user,session)  (new_ver, new_tokens))

State transitions per session proceed from initialization (SINITS_{\mathrm{INIT}}) to serving (SSERVINGS_{\mathrm{SERVING}}), with eventual eviction (SEVICTEDS_{\mathrm{EVICTED}}) on TTL expiry.

Complexity Analysis

Let L=TL = |T| (context length), Δ=p\Delta = |p| (new tokens), RR (number of replicas):

  • Write serialization: O(L+Δ)O(L+\Delta).
  • Network bandwidth per update: Bsync=(R1)×stoken×ΔB_{\text{sync}} = (R-1) \times s_{\text{token}} \times \Delta.
  • Replication latency: Tsync(L,Δ,R)α+βstoken(L+Δ)logRT_{\text{sync}}(L,\Delta,R) \approx \alpha + \beta \cdot s_{\text{token}} \cdot (L+\Delta) \cdot \log R (with α\alpha = network RTT, β\beta = per-byte factor).

Amortized per-turn costs grow only with number of new tokens, not total context length.

4. Context Operations: API, Caching, and Eviction

DisCEdge’s programming interface and memory management are optimized for interactive, large-scale, multi-session LLM use cases.

API Semantics

  • StartSession: Optionally assigns and returns new user_id and session_id.
  • Query (turn kk): Requires identifiers, prior client_turn = k-1, and a new prompt. Operation blocks until the local context is current.
  • Session Update: Upon response, the client increments its turn.

Caching and Session Eviction

  • In-Memory Cache: Context Manager maintains a per-session hash-map of token sequences.
  • KV Store TTL: Enforces expiration and eviction of stale session state.
  • Anticipated Extensions: LRU cache policies and session idle timeouts are proposed for future versions, improving multi-tenant resource efficiency.

5. Quantitative Evaluation

DisCEdge was evaluated using an open-source prototype on heterogeneous, resource-constrained edge devices and clients.

Experimental Setup

  • Edge Nodes: Nvidia Jetson TX2 (ARM), Apple M2 (ARM), both deploying llama.cpp.
  • Client Device: Raspberry Pi 4 (ARM).
  • Network: Local LAN, sub-1 ms RTT.
  • Scenario: 9-turn dialogue using 4-bit quantized Qwen1.5-0.5B-Chat model.
  • Consistency Configuration: max_retries = 3, backoff = 10 ms.

Performance Metrics

  • Median end-to-end response time
  • LLM tokens-per-second throughput
  • Inter-node synchronization traffic per turn
  • Client-to-server request size

Key Results

Metric TX2 (raw) TX2 (tok) M2 (raw) M2 (tok) Δ tok vs raw
Median latency (ms) 254.2 217.6 112.3 102.5 –14.46 % / –8.75 %
TPS (tokens/sec) 176 181 412 418 +2.85 % / +1.41 %
Sync bandwidth (KB/turn) 15.0 12.8 9.2 8.0 –15 % / –13.3 %
Metric (client side) Client-side DisCEdge Δ
Median latency (ms) 232.4 219.0 –5.93 %
Client→server request size (KB) ~45 ≈4 –90 %

Across all workloads and hardware, tokenized storage consistently improves both latency and bandwidth consumption relative to raw-text or client-side context management.

6. Limitations and Future Work

Scalability and Multi-Tenancy

  • Scaling limitations: KV layer contention emerges with high multi-tenant concurrency. LLM inference throughput sets the upper bound.
  • Replica set size: Synchronization overhead grows as O(logR)O(\log R) (gossip) or O(R)O(R) (primary-fanout).

Fault Tolerance

  • KV partitioning: Context Manager may either block (strong consistency) or proceed with potentially stale state (weak consistency, configurable).
  • Future direction: Integration of quorum-based consistency for tunable CP/CA trade-offs.

Security and Privacy

  • Tokenization: While integer tokens eliminate direct raw PII, token streams still encode user data.
  • Future work: End-to-end token stream encryption and differential privacy protections are under consideration.

Enhancement Directions

  • Summarization/snapshotting: Periodically condense long context histories prior to token storage to control token sequence growth.
  • KV-cache integration: Direct manipulation of the LLM’s internal attention KV cache is proposed to bypass external token sequence maintenance.
  • Predictive handovers: Proactive context replication to likely next edge nodes, informed by user mobility.
  • Eviction policies: Exploration of hybrid TTL and LRU for efficient memory management.

DisCEdge exemplifies an efficient approach to distributed LLM context management at the edge, leveraging tokenized storage, client-driven consistency, and decentralized KV replication to prioritize low latency, bandwidth efficiency, and seamless user experience (Malekabbasi et al., 27 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DisCEdge.