MoSKA Architecture: Modular Streaming for LLMs
- MoSKA Architecture is a modular disaggregated LLM serving design that separates inference into prompt-processing and token-generation stages to mitigate bottlenecks.
- It employs microbatch KV-cache swapping and streaming state replication to optimize memory usage and enhance resilience against hardware failures.
- Empirical results show doubled throughput and reduced latency, evidencing practical benefits for large-scale, high-availability AI serving.
MoSKA ("Modular Streaming Key-Value Architecture" Editor's term) refers to a class of disaggregated LLM serving designs that address throughput and fault-tolerance bottlenecks in pipeline-parallel and cluster-scale inference. MoSKA architectures decompose Transformer model inference into modular prompt-processing (“prefill”) and autoregressive token-generation (“decode”) stages, enabling independent pipeline balancing, memory optimization, and resilience to hardware failures. The approach centers on explicit prompt-token disaggregation, fine-grained GPU memory management via microbatch KV cache swapping, and failover through state streaming and replication.
1. Architectural Decomposition and Data Flow
A prototypical MoSKA system adopts a controller–worker arrangement. Clients submit prompts to a centralized Controller, which dispatches requests to Prompt workers (“P-workers”), each responsible for executing the initial portion of the Transformer—typically up to the first output token. Upon completion, the P-worker generates a partial KV-cache for the prompt and streams it (via primitives such as stream_out, scatter, flush) over high-speed network or PCIe to a Token worker (“T-worker”). The T-worker receives the prompt-state (“KV-cache”), integrates it into local accelerator memory (stream_in, gather, fetch), and resumes the inference workload by generating subsequent tokens autoregressively.
Within each T-worker, the MoSKA architecture incorporates microbatch swapping: only the currently active and immediately upcoming microbatches reside in GPU memory, with the remainder staged in CPU or host RAM, accessible by demand-driven DMA. Additionally, critical state (KV-cache slices) are replicated downstream to neighbor workers in the pipeline, ensuring redundancy for robustness under node failure.
2. Prompt-token Disaggregation and Pipeline Bubble Mitigation
Traditional pipeline-parallel deployments for LLM inference suffer from severe underutilization due to the bimodal latency of prompt vs. token processing: the initial prefill (“prompt”) step exhibits substantially higher latency () than per-token decode steps (). In a -stage pipeline, this “bubble fraction” is quantified as
where is the number of tokens generated per request. For large ratios and moderate , idle time dominates total compute, sharply reducing hardware MFU under load.
MoSKA circumvents this by explicit prompt-token disaggregation: the -stage pipeline is split into prompt-processing and token-processing machines, connected via streaming KV-cache transfer. The optimal division balances workload so that
where is the effective prompt-KV cache transfer overhead factor. This partitioning minimizes bubble fraction and maximizes throughput by matching the rate of prompt completion to downstream token generation—a design validated by peak-to-peak empirical throughput doubling on large model benchmarks.
3. Microbatch-level KV-Cache Swapping for Memory Efficiency
In standard pipeline configurations, each worker statically reserves GB of GPU memory for KV-cache, where is the pipeline degree and the per-microbatch KV-cache footprint. MoSKA introduces a dynamic scheme where only $2M$ GB is held in GPU memory (current plus one prefetch slot), while the full set of microbatches is staged in host RAM. Workers “swap in” the necessary microbatch state from host to device upon activation and “swap out” updates post-processing.
The swap overhead per microbatch is
which, under high PCIe bandwidth and with careful compute–communication overlap, remains subdominant to per-token compute time and is generally fully hidden. This memory multiplexing allows supporting up to larger effective batch sizes and boosts sustained throughput accordingly.
4. Streaming State Replication and Fault Tolerance
MoSKA architectures implement rolling fault tolerance by augmenting the pipeline with per-token state replication. After each token generation step for microbatch at pipeline stage , the current KV-cache increment is asynchronously flushed to the downstream neighbor. Each pipeline worker notifies the controller of its latest safe checkpoint (microbatch and token id). Upon detection of a failure, the controller initiates the following recovery protocol:
- Pipeline is paused, failed stage is replaced.
- The downstream neighbor supplies a fresh copy of the latest KV-cache fragment for the affected microbatch (up to token ).
- The upstream neighbor provides KV state for microbatch , enabling ring reconstruction.
- Replay proceeds from the last agreed safe step across all stages.
The worst-case recovery time is bounded by transfer and recomputation
with for per-token checkpoints and the network link bandwidth. In practice, this design achieves sub-second recovery and reduces fault-induced latency amplification from typical to .
5. Empirical Results and Comparative Benchmarks
Table: Throughput, latency, and resilience enhancements with MoSKA-like designs (extracted for representative models):
| Model | Baseline (tok/s) | MoSKA (tok/s) | Speedup | 95% Tail Latency (ms) |
|---|---|---|---|---|
| OPT-66B | 3,200 | 6,000 | 1.88× | 19.6 → 10.5 |
| BLOOM-176B | 1,300 | 2,600 | 2.0× | 35.0 → 17.8 |
Microbatch swapping throughput improvements:
| Model | no-swap (tok/s) | with-swap (tok/s) | Speedup |
|---|---|---|---|
| OPT-66B | 600 | 1,020 | 1.7× |
| BLOOM-176B | 240 | 420 | 1.75× |
Single-failure latency amplification:
| System | Factor | Reduction |
|---|---|---|
| Baseline | 1.91× | – |
| MoSKA | 1.24× | 35% |
These data highlight that modular streaming and disaggregation architectures can double LLM serving throughput, support nearly twice the effective batch size, and expedite failure recovery—all critical at hyperscale settings.
6. Architectural Trade-offs and Best Practices
Explicit prompt-token disaggregation is most beneficial when prompt latency dominates per-token (long contexts, large models); it is less impactful for workloads where . The incremental KV cache transfer overhead must be kept minimal (ideally ) by efficient buffering, pipelined, and per-layer streaming. Microbatch swapping is gated by PCIe or memory bandwidth but, if carefully overlapped, can be doubled with minimal penalty. State replication for resilience is modular and incurs negligible average cost.
Optimal pipeline depth allocation can be tuned via direct application of the balancing equations provided above. To maximize utility, prefetch pipelines to match their steady-state rates (), and buffer microbatches to fully utilize PCIe and compute cycles.
7. Broader Context and Impact
MoSKA encapsulates the ongoing shift in distributed LLM inference from monolithic or naïvely pipeline-parallel paradigms to fully modular, streaming, and fault-tolerant designs. By formalizing the separation of prompt and token phases, introducing memory- and compute-aware swapping, and integrating continual streaming checkpointing, this architectural paradigm directly addresses the resource underutilization and latency vulnerabilities prevalent in hyperscale generative inference. The approach facilitates reproducible scaling to the largest public models and provides robust blueprints for emerging high-availability AI serving infrastructure.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free