Papers
Topics
Authors
Recent
2000 character limit reached

MoSKA Architecture: Modular Streaming for LLMs

Updated 24 November 2025
  • MoSKA Architecture is a modular disaggregated LLM serving design that separates inference into prompt-processing and token-generation stages to mitigate bottlenecks.
  • It employs microbatch KV-cache swapping and streaming state replication to optimize memory usage and enhance resilience against hardware failures.
  • Empirical results show doubled throughput and reduced latency, evidencing practical benefits for large-scale, high-availability AI serving.

MoSKA ("Modular Streaming Key-Value Architecture" Editor's term) refers to a class of disaggregated LLM serving designs that address throughput and fault-tolerance bottlenecks in pipeline-parallel and cluster-scale inference. MoSKA architectures decompose Transformer model inference into modular prompt-processing (“prefill”) and autoregressive token-generation (“decode”) stages, enabling independent pipeline balancing, memory optimization, and resilience to hardware failures. The approach centers on explicit prompt-token disaggregation, fine-grained GPU memory management via microbatch KV cache swapping, and failover through state streaming and replication.

1. Architectural Decomposition and Data Flow

A prototypical MoSKA system adopts a controller–worker arrangement. Clients submit prompts to a centralized Controller, which dispatches requests to Prompt workers (“P-workers”), each responsible for executing the initial portion of the Transformer—typically up to the first output token. Upon completion, the P-worker generates a partial KV-cache for the prompt and streams it (via primitives such as stream_out, scatter, flush) over high-speed network or PCIe to a Token worker (“T-worker”). The T-worker receives the prompt-state (“KV-cache”), integrates it into local accelerator memory (stream_in, gather, fetch), and resumes the inference workload by generating subsequent tokens autoregressively.

Within each T-worker, the MoSKA architecture incorporates microbatch swapping: only the currently active and immediately upcoming microbatches reside in GPU memory, with the remainder staged in CPU or host RAM, accessible by demand-driven DMA. Additionally, critical state (KV-cache slices) are replicated downstream to neighbor workers in the pipeline, ensuring redundancy for robustness under node failure.

2. Prompt-token Disaggregation and Pipeline Bubble Mitigation

Traditional pipeline-parallel deployments for LLM inference suffer from severe underutilization due to the bimodal latency of prompt vs. token processing: the initial prefill (“prompt”) step exhibits substantially higher latency (YY) than per-token decode steps (tt). In a DD-stage pipeline, this “bubble fraction” is quantified as

B=(D1)(Yt)D(Y+Nt)B = \frac{(D-1)(Y-t)}{D(Y+Nt)}

where NN is the number of tokens generated per request. For large Y/tY/t ratios and moderate DD, idle time dominates total compute, sharply reducing hardware MFU under load.

MoSKA circumvents this by explicit prompt-token disaggregation: the DD-stage pipeline is split into DpD_p prompt-processing and DtD_t token-processing machines, connected via streaming KV-cache transfer. The optimal division balances workload so that

Dt=DNtmY+Nt,Dp=DDtD_t = \frac{D\,N\,t}{mY + Nt}, \quad D_p = D - D_t

where mm is the effective prompt-KV cache transfer overhead factor. This partitioning minimizes bubble fraction and maximizes throughput by matching the rate of prompt completion to downstream token generation—a design validated by peak-to-peak empirical throughput doubling on large model benchmarks.

3. Microbatch-level KV-Cache Swapping for Memory Efficiency

In standard pipeline configurations, each worker statically reserves DMD \cdot M GB of GPU memory for KV-cache, where DD is the pipeline degree and MM the per-microbatch KV-cache footprint. MoSKA introduces a dynamic scheme where only $2M$ GB is held in GPU memory (current plus one prefetch slot), while the full set of microbatches is staged in host RAM. Workers “swap in” the necessary microbatch state from host to device upon activation and “swap out” updates post-processing.

The swap overhead per microbatch is

Oswap=ΔKVBWPCIeO_\mathrm{swap} = \frac{|\Delta\mathrm{KV}|}{\mathrm{BW}_\mathrm{PCIe}}

which, under high PCIe bandwidth and with careful compute–communication overlap, remains subdominant to per-token compute time tt and is generally fully hidden. This memory multiplexing allows supporting up to 1.8×1.8\times larger effective batch sizes and boosts sustained throughput accordingly.

4. Streaming State Replication and Fault Tolerance

MoSKA architectures implement rolling fault tolerance by augmenting the pipeline with per-token state replication. After each token generation step for microbatch jj at pipeline stage xx, the current KV-cache increment is asynchronously flushed to the x+1x+1 downstream neighbor. Each pipeline worker notifies the controller of its latest safe checkpoint (j,k)(j,k) (microbatch and token id). Upon detection of a failure, the controller initiates the following recovery protocol:

  • Pipeline is paused, failed stage xx is replaced.
  • The downstream neighbor supplies a fresh copy of the latest KV-cache fragment for the affected microbatch (up to token kk').
  • The upstream neighbor provides KV state for microbatch j1j-1, enabling ring reconstruction.
  • Replay proceeds from the last agreed safe step across all stages.

The worst-case recovery time is bounded by transfer and recomputation

TrecoverSΔBnet+ΔtT_\mathrm{recover} \leq \frac{S\cdot \Delta}{B_\mathrm{net}} + \Delta\,t

with Δ=1\Delta=1 for per-token checkpoints and BnetB_\mathrm{net} the network link bandwidth. In practice, this design achieves sub-second recovery and reduces fault-induced latency amplification from typical 1.9×1.9\times to 1.24×1.24\times.

5. Empirical Results and Comparative Benchmarks

Table: Throughput, latency, and resilience enhancements with MoSKA-like designs (extracted for representative models):

Model Baseline (tok/s) MoSKA (tok/s) Speedup 95% Tail Latency (ms)
OPT-66B 3,200 6,000 1.88× 19.6 → 10.5
BLOOM-176B 1,300 2,600 2.0× 35.0 → 17.8

Microbatch swapping throughput improvements:

Model no-swap (tok/s) with-swap (tok/s) Speedup
OPT-66B 600 1,020 1.7×
BLOOM-176B 240 420 1.75×

Single-failure latency amplification:

System Factor Reduction
Baseline 1.91×
MoSKA 1.24× 35%

These data highlight that modular streaming and disaggregation architectures can double LLM serving throughput, support nearly twice the effective batch size, and expedite failure recovery—all critical at hyperscale settings.

6. Architectural Trade-offs and Best Practices

Explicit prompt-token disaggregation is most beneficial when prompt latency YY dominates per-token tt (long contexts, large models); it is less impactful for workloads where YtY \approx t. The incremental KV cache transfer overhead mm must be kept minimal (ideally <2<2) by efficient buffering, pipelined, and per-layer streaming. Microbatch swapping is gated by PCIe or memory bandwidth but, if carefully overlapped, can be doubled with minimal penalty. State replication for resilience is modular and incurs negligible average cost.

Optimal pipeline depth allocation (Dp,Dt)(D_p, D_t) can be tuned via direct application of the balancing equations provided above. To maximize utility, prefetch pipelines to match their steady-state rates (mYdisNtdism Y_\mathrm{dis} \approx N t_\mathrm{dis}), and buffer microbatches to fully utilize PCIe and compute cycles.

7. Broader Context and Impact

MoSKA encapsulates the ongoing shift in distributed LLM inference from monolithic or naïvely pipeline-parallel paradigms to fully modular, streaming, and fault-tolerant designs. By formalizing the separation of prompt and token phases, introducing memory- and compute-aware swapping, and integrating continual streaming checkpointing, this architectural paradigm directly addresses the resource underutilization and latency vulnerabilities prevalent in hyperscale generative inference. The approach facilitates reproducible scaling to the largest public models and provides robust blueprints for emerging high-availability AI serving infrastructure.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MoSKA Architecture.