MoSKA: Shared KV Attention for LLMs
- The paper introduces MoSKA, which rebalances unique and shared key-value processing to convert memory-bound operations into compute-bound GEMM calls, achieving up to 538.7× throughput improvement.
- MoSKA partitions shared tokens using Mixture-of-Experts routing to achieve up to 75% sparsity, significantly reducing compute and memory utilization while focusing on semantically salient contexts.
- MoSKA employs a disaggregated hardware design by separating Unique-KV and Shared-KV nodes, allowing independent scaling of memory bandwidth and compute resources for efficient long-sequence inference.
Mixture of Shared KV Attention (MoSKA) is an architectural paradigm for efficient long-sequence inference in LLMs, focused on overcoming the severe performance limitations introduced by memory-bound Key-Value (KV) cache operations as context lengths and batch sizes scale. MoSKA exploits workload heterogeneity by decomposing context data into per-request unique tokens and massively reused shared sequences, introducing compute-efficient shared KV processing and sparsification mechanisms. This entry synthesizes principal methods, mathematical formulations, hardware considerations, and empirical evaluation from the foundational sources.
1. KV Cache Bottlenecks in Long-Sequence LLMs
Modern LLM inference with multi-million-token contexts exposes a critical bottleneck: the KV cache. During autoregressive decoding, queries attend across all previous keys and values. Naïve implementations, including widely adopted strategies such as Grouped-Query Attention (GQA), quantization, and uniform sparsity, result in memory-bound General Matrix-Vector (GEMV) operations whose throughput scales poorly under increasing batch size and sequence length —see (Rhee et al., 8 Nov 2025).
Shared context tokens, such as fixed system prompts or legal corpora, though identical across requests, are traditionally cached per request, leading to redundant memory bandwidth consumption. As batch concurrency grows, GPUs become increasingly underutilized in compute while being throttled by memory bandwidth.
2. Shared KV Attention Transformation
MoSKA converts the bottlenecked KV cache processing into high-intensity compute operations by leveraging simultaneous access to shared context across requests. For concurrent requests attending to the same shared key () and value () matrices, naïve attention would instantiate GEMV calls for each query : streaming and from memory times.
MoSKA instead stacks all queries into a batched matrix and utilizes two GEMM kernels:
or compactly,
This shifts arithmetic complexity from memory-bound to compute-bound, substantially improving GPU utilization. The transformation requires maximizing , data contiguity for , and persistent device memory for , launching two optimized GEMM kernels.
3. Sparse Attention via Mixture-of-Experts Routing
Computing attention over all shared tokens remains prohibitive for extremely large contexts. MoSKA uses Mixture-of-Experts (MoE)-inspired routing to trim the active KV region. The shared cache is partitioned into “chunks” (experts), each with a representative embedding . Each query scores all chunks: and selects the top- most relevant: Pseudocode for sparse attention:
1 2 3 4 5 6 7 8 9 |
Input: batch queries {q_i}, chunk embeddings {e_j}
for i in 1..N:
scores = q_i @ E.T
R_i = top_k(scores)
Collect all (i,j) pairs into Q_r, K_r, V_r
A = GEMM(Q_r, K_r.T) / sqrt(d)
P = row_softmax(A)
Y_r = GEMM(P, V_r)
scatter Y_r to output positions {y_i} |
4. Disaggregated Hardware for Unique and Shared Context
To accommodate divergent memory and compute profiles, MoSKA proposes disaggregated serving infrastructure:
- Unique-KV Nodes: Serve memory-bound per-request unique tokens, equipped with high-bandwidth memory (HBM2e/HBM3), and co-located feed-forward layers for latency hiding.
- Shared-KV Nodes: Execute compute-bound batched GEMMs for shared chunks, with abundant tensor cores and persistent caching of shared context.
A scheduler routes incoming queries to the appropriate node type, allowing independent hardware scaling: adding Shared-KV nodes increases shared context bandwidth, whereas adding Unique-KV nodes serves more concurrent requests.
5. Empirical Results: Throughput and Scaling
MoSKA was benchmarked against FlashAttention, SGLang, LongHeads, and ChunkAttention using Llama 3.1 8B in FP8 (see (Rhee et al., 8 Nov 2025)). With K unique tokens/request and M–16M shared tokens, and a 35 tok/s/request target generation rate:
| Method | Peak batch | Throughput | Speedup vs FlashAttention |
|---|---|---|---|
| FlashAttention | 16 | 1 × (Ref.) | 1 × |
| SGLang | 64 | 8 × | 8 × |
| LongHeads | 32 | 5 × | 5 × |
| ChunkAttention | 128 | 120 × | 120 × |
| MoSKA | 256 | 538.7 × | 538.7 × |
MoSKA maintains compute utilization on Shared-KV nodes (even for $16$M-token caches) and isolates low-MFU, memory-bound tasks on Unique-KV nodes. This suggests that, under high query concurrency and context sharing, MoSKA is able to scale attention throughput nearly 540-fold relative to FlashAttention.
6. Design Trade-offs and Limitations
MoSKA’s benefits are subject to several constraints:
- Concurrency requirement: GEMM startup overhead is amortized only under moderate-to-high query concurrency (–64); low concurrency yields smaller benefits.
- Memory overhead: The system incurs ~5–10% extra cache to maintain chunk embeddings and MoE routing structures.
- Routing cost: Top- selection per query (cost ) is manageable for moderate but requires optimized approximate nearest-neighbor algorithms if . A plausible implication is that workloads with predominantly unique tokens or low shared context utilization see reduced gains.
7. Future Directions
Active research avenues include:
- Adaptive GPU scheduling across heterogeneous devices (compute-optimized, memory-optimized nodes).
- Position-independent chunking (as in EPIC) for arbitrary composability (“Universal MoSKA”).
- Hardware-level acceleration of routing/top-k and tensor slicing.
Collectively, MoSKA establishes a systematic blueprint for high-performance, resource-adaptive LLM inference under large context and high concurrency settings, explicitly leveraging context data heterogeneity to re-balance memory and compute utilization, and achieving both throughput and scaling advantages that are substantiated by analytical and experimental results (Rhee et al., 8 Nov 2025).