MoSKA: Shared KV Attention for LLMs

Updated 15 November 2025

The paper introduces MoSKA, which rebalances unique and shared key-value processing to convert memory-bound operations into compute-bound GEMM calls, achieving up to 538.7× throughput improvement.
MoSKA partitions shared tokens using Mixture-of-Experts routing to achieve up to 75% sparsity, significantly reducing compute and memory utilization while focusing on semantically salient contexts.
MoSKA employs a disaggregated hardware design by separating Unique-KV and Shared-KV nodes, allowing independent scaling of memory bandwidth and compute resources for efficient long-sequence inference.

Mixture of Shared KV Attention (MoSKA) is an architectural paradigm for efficient long-sequence inference in LLMs, focused on overcoming the severe performance limitations introduced by memory-bound Key-Value (KV) cache operations as context lengths and batch sizes scale. MoSKA exploits workload heterogeneity by decomposing context data into per-request unique tokens and massively reused shared sequences, introducing compute-efficient shared KV processing and sparsification mechanisms. This entry synthesizes principal methods, mathematical formulations, hardware considerations, and empirical evaluation from the foundational sources.

1. KV Cache Bottlenecks in Long-Sequence LLMs

Modern LLM inference with multi-million-token contexts exposes a critical bottleneck: the KV cache. During autoregressive decoding, queries attend across all previous keys and values. Naïve implementations, including widely adopted strategies such as Grouped-Query Attention (GQA), quantization, and uniform sparsity, result in memory-bound General Matrix-Vector (GEMV) operations whose throughput scales poorly under increasing batch size $B$ and sequence length $L$ —see (Rhee et al., 8 Nov 2025).

Shared context tokens, such as fixed system prompts or legal corpora, though identical across requests, are traditionally cached per request, leading to redundant memory bandwidth consumption. As batch concurrency grows, GPUs become increasingly underutilized in compute while being throttled by memory bandwidth.

2. Shared KV Attention Transformation

MoSKA converts the bottlenecked KV cache processing into high-intensity compute operations by leveraging simultaneous access to shared context across requests. For $N$ concurrent requests attending to the same shared key ( $K_s \in \mathbb{R}^{M \times d}$ ) and value ( $V_s \in \mathbb{R}^{M \times d}$ ) matrices, naïve attention would instantiate $N$ GEMV calls for each query $q_i$ : $a_i = \mathrm{softmax}\left(\frac{q_i K_s^T}{\sqrt{d}}\right), \quad y_i = a_i V_s,$ streaming $K_s$ and $V_s$ from memory $N$ times.

MoSKA instead stacks all $N$ queries into a batched matrix $Q_b \in \mathbb{R}^{N \times d}$ and utilizes two GEMM kernels: $A = Q_b K_s^T, \qquad A' = A / \sqrt{d}$

$P = \mathrm{softmax}(A'), \qquad Y = P V_s$

or compactly,

$Y = \mathrm{softmax}\left(\frac{Q_b K_s^T}{\sqrt{d}}\right) V_s$

This shifts arithmetic complexity from $O(NMd)$ memory-bound to $O(NMd)$ compute-bound, substantially improving GPU utilization. The transformation requires maximizing $N$ , data contiguity for $Q_b$ , and persistent device memory for $K_s, V_s$ , launching two optimized GEMM kernels.

3. Sparse Attention via Mixture-of-Experts Routing

Computing attention over all shared tokens remains prohibitive for extremely large contexts. MoSKA uses Mixture-of-Experts (MoE)-inspired routing to trim the active KV region. The shared cache is partitioned into $C$ “chunks” (experts), each with a representative embedding $e_j \in \mathbb{R}^d$ . Each query $q_i$ scores all chunks: $s_{i,j} = q_i \cdot e_j, \qquad j = 1,\ldots,C$ and selects the top- $k$ most relevant: $S_i = \operatorname{arg\,top}_k(s_{i,j})$ Pseudocode for sparse attention:

Input: batch queries {q_i}, chunk embeddings {e_j}
for i in 1..N:
    scores = q_i @ E.T
    R_i = top_k(scores)
Collect all (i,j) pairs into Q_r, K_r, V_r
A = GEMM(Q_r, K_r.T) / sqrt(d)
P = row_softmax(A)
Y_r = GEMM(P, V_r)
scatter Y_r to output positions {y_i}

The final output per query:

y_i = \sum_{j \in R_i} \mathrm{softmax}\left(\frac{q_i K_j^T}{\sqrt{d}}\right) V_j

This mechanism achieves up to 75% sparsity in routing for workloads with

k/C \approx 25\%

, reducing compute and memory utilization while focusing on semantically salient chunks.

4. Disaggregated Hardware for Unique and Shared Context

To accommodate divergent memory and compute profiles, MoSKA proposes disaggregated serving infrastructure:

Unique-KV Nodes: Serve memory-bound per-request unique tokens, equipped with high-bandwidth memory (HBM2e/HBM3), and co-located feed-forward layers for latency hiding.
Shared-KV Nodes: Execute compute-bound batched GEMMs for shared chunks, with abundant tensor cores and persistent caching of shared context.

A scheduler routes incoming queries to the appropriate node type, allowing independent hardware scaling: adding Shared-KV nodes increases shared context bandwidth, whereas adding Unique-KV nodes serves more concurrent requests.

5. Empirical Results: Throughput and Scaling

MoSKA was benchmarked against FlashAttention, SGLang, LongHeads, and ChunkAttention using Llama 3.1 8B in FP8 (see (Rhee et al., 8 Nov 2025)). With $L_u = 64$ K unique tokens/request and $L_s = 1$ M–16M shared tokens, and a 35 tok/s/request target generation rate:

Method	Peak batch	Throughput	Speedup vs FlashAttention
FlashAttention	16	1 × (Ref.)	1 ×
SGLang	64	8 ×	8 ×
LongHeads	32	5 ×	5 ×
ChunkAttention	128	120 ×	120 ×
MoSKA	256	538.7 ×	538.7 ×

MoSKA maintains $>80\%$ compute utilization on Shared-KV nodes (even for $16$M-token caches) and isolates low-MFU, memory-bound tasks on Unique-KV nodes. This suggests that, under high query concurrency and context sharing, MoSKA is able to scale attention throughput nearly 540-fold relative to FlashAttention.

6. Design Trade-offs and Limitations

MoSKA’s benefits are subject to several constraints:

Concurrency requirement: GEMM startup overhead is amortized only under moderate-to-high query concurrency ( $N \geq 32$ –64); low concurrency yields smaller benefits.
Memory overhead: The system incurs ~5–10% extra cache to maintain chunk embeddings and MoE routing structures.
Routing cost: Top- $k$ selection per query (cost $O(NC)$ ) is manageable for moderate $C$ but requires optimized approximate nearest-neighbor algorithms if $C \gg 1000$ . A plausible implication is that workloads with predominantly unique tokens or low shared context utilization see reduced gains.

7. Future Directions

Active research avenues include:

Adaptive GPU scheduling across heterogeneous devices (compute-optimized, memory-optimized nodes).
Position-independent chunking (as in EPIC) for arbitrary composability (“Universal MoSKA”).
Hardware-level acceleration of routing/top-k and tensor slicing.

Collectively, MoSKA establishes a systematic blueprint for high-performance, resource-adaptive LLM inference under large context and high concurrency settings, explicitly leveraging context data heterogeneity to re-balance memory and compute utilization, and achieving both throughput and scaling advantages that are substantiated by analytical and experimental results (Rhee et al., 8 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Mixture of Shared KV Attention (MoSKA).