AcceLLM: Distributed LLM Inference
- AcceLLM is a distributed inference framework for LLMs that exploits layerwise KV-cache duplication to enable dynamic load balancing and reduce latency.
- It organizes GPUs into tightly paired instances with prefill and decode roles, ensuring continuous utilization and efficient resource management.
- Empirical evaluations show up to 30% faster end-to-end job completion and near-zero idle time under mixed workloads compared to traditional systems.
AcceLLM refers to a distributed inference framework for LLMs that strategically exploits redundancy—specifically, layerwise duplication of KV-caches across pairs of accelerator instances—to enable dynamic load balancing, minimize latency, and increase resource utilization for high-throughput LLM serving in multi-GPU or multi-accelerator clusters. The essential innovation lies in organizing compute instances into tightly coupled pairs, maintaining redundant state, and dynamically switching between "prefill" (prompt processing) and "decode" (autoregressive generation) roles to mitigate typical bottlenecks found in both monolithic and strictly disaggregated inference schemes (Bournias et al., 2024). This architectural choice delivers up to 30% faster end-to-end job completion and improves per-instance efficiency compared to established baselines.
1. Architectural Organization and Operational Model
AcceLLM’s system consists of accelerator "instances" (typically GPUs) arranged in pairs, each running an uncompressed copy of the LLM weights. The central Scheduling Manager orchestrates routing of inference requests, tracking memory per instance (for hosting active KV-caches, including redundant copies), and dynamically designates each instance as prefill or decode according to system load and pending requests.
In each pair, one instance operates in prefill mode (processing new input prompts layer by layer), while the other handles decoding (token generation for already-prefilled sequences). As prefill progresses, each layer’s KV-cache output is streamed not only locally but also to the paired decode instance. If no new prefill arises, both instances in a pair switch to decode mode, evenly splitting the outstanding workload—each holding a full set of redundant cache shards, which enables flexible, near-perfect load balancing.
Throughout decoding, each new KV-cache fragment produced is incrementally streamed to the paired instance, thereby keeping redundant caches synchronized and ensuring that either member has up-to-date state to assume any role as required.
2. Mathematical Optimization and Redundancy Model
The fundamental optimization in AcceLLM is to minimize job completion time (JCT) and per-token delay (TBT) while respecting hardware memory limits. The scheduling and redundancy model is formalized as follows:
- Let index the accelerator instances.
- Each request at time has KV-cache size .
- denotes the high-bandwidth memory of instance .
- Indicator variable specifies if instance holds a copy of 's cache.
- Instance roles .
Subject to:
This enforces, for each request, at least two copies of its cache are distributed across the system (primary and redundant).
Job allocation prioritizes the instance pair with the most free memory. During prefill, layerwise KV fragments are sent to the decode peer, ensuring early decoding can commence. As decoding proceeds, incremental cache extensions are kept redundant, so either instance can continue or share the generation workload at any stage.
3. Scheduler, Load Balancing, and Algorithmic Control
The Scheduling Manager’s core algorithmic loop maintains perfect utilization by continually flipping instance roles to meet system demand:
- On new request: select optimal pair, assign prefill/decode roles, and replicate data.
- During operation: regularly check for available compute, switch idle prefill-capable instances to decoding, and resize decode batches for even load between peers.
- All role transitions and KV-cache transfers are incremental (layerwise), avoiding monolithic whole-cache copying and the associated latency spikes.
Within each instance, inference progresses as:
- Prefill: For each layer, produce the attention output (KV_shard) and stream immediately to both local and paired instance.
- Decode: For each incoming token, recurrently process all layers, produce incremental KV_shards, and update the peer’s redundant copy after each attention block.
This cooperative, fine-grained redundancy and scheduling enables the system to exploit every available compute cycle, even under highly heterogeneous request regimes.
4. Empirical Performance and Metrics
Evaluations were performed on LLaMA-2 70B on simulated Nvidia H100 SXM5 and Huawei Ascend 910B2 platforms, with varying cluster sizes (4–16 instances) and request types (light, mixed, heavy).
Key performance metrics:
- TTFT (Time-To-First-Token): Latency until first generated token.
- TBT (Time-Between-Tokens): Mean token generation interval.
- JCT (Job Completion Time): , is the request length.
- Cost Efficiency: Tokens generated / (number-of-instances runtime).
Summary of results:
| Metric | Splitwise | vLLM | AcceLLM |
|---|---|---|---|
| JCT (mixed, 8 H100) | 0.22 tok/i/s | 0.24 tok/i/s | 0.30 tok/i/s (+30%) |
| TBT improvement | baseline | baseline | −15% to −20% |
| Light workload: IDLE | 90% idle | variable | ≈0% |
| Additional memory | N/A | N/A | +1–5 GB / inst |
| Network overhead | similar | similar | ≈ Splitwise |
Under mixed and heavy workloads, AcceLLM consistently improves JCT by 25–30% compared to disaggregated and batched baselines. When prompt arrivals are sparse, prior systems leave prefill GPUs idle up to 90% of the time, while AcceLLM retains full utilization. Network and memory overheads remain contained: only 1–5 GB extra per instance for redundant KV-caches and no increase in interconnect saturation, since only small per-layer updates are exchanged rather than full-cache copying (Bournias et al., 2024).
5. Comparison With Prior and Related Systems
Traditional batched systems, such as vLLM and FasterTransformer, combine prefill and decode on the same device, leading to significant latency spikes for ongoing decode requests when new prompts are introduced. Disaggregated models (e.g., Splitwise, TetriInfer, DistServe) statically allocate some devices to prefill, some to decode, and perform monolithic inter-GPU cache copies when transitions are needed, leading to idle periods and transfer bottlenecks.
AcceLLM’s distinctive approach is as follows:
- Redundant cache shards sidestep transfer latency spikes.
- Every GPU is kept either decoding or prefill-serving; idle is avoided.
- Decode batches can be arbitrarily balanced between the paired instances, eliminating straggler effects and maximizing throughput.
A plausible implication is that AcceLLM’s design could provide a template for future extensions involving higher degrees of redundancy, where -way cache copies offer not only load balancing but also improved fault tolerance or elasticity across heterogeneous clusters.
6. Limitations and Trade-Offs
The principal constraint of AcceLLM’s approach is the increased per-request memory usage—each request's cache must occupy space on two GPUs. This implies a possible scaling ceiling when serving extremely large models or ultra-long contexts, particularly if device memory is a hard bottleneck. The benefits also presuppose a high-speed interconnect (such as NVLink or CCIX); on slower PCIe fabrics, the performance gains may be attenuated by increased communication latency.
Other trade-offs include more complex scheduler logic to handle dynamic role-flipping and data placement, as well as a modest bandwidth overhead from additional per-layer cache streaming. Current instance pairing is static; clusters with odd-sized or heterogeneous accelerators are not inherently supported.
7. Future Directions and Potential Extensions
Future work on AcceLLM envisions several promising avenues:
- Extending to -way redundancy, optimizing the trade-off between memory use and load balancing, as well as incorporating fault-tolerant or heterogeneous scheduling.
- Dynamic adaptation of redundancy based on memory pressure, potentially defaulting to single-copy cache with fallback to full-cache transfer only under tight constraints.
- Integration of workload predictors to exploit request patterns and pre-warm KV-caches.
- Deployment in multi-tenant or hybrid CPU/GPU data centers, where cache residency and migration strategies could be further optimized.
- Queue-aware admission control to enforce quality-of-service constraints under bursty or adversarial request patterns.
These future directions align with the overarching objective of maintaining near-ideal accelerator utilization and low-latency distributed LLM inference in increasingly diverse and large-scale deployment environments (Bournias et al., 2024).