Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 183 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

MoSKA: Shared KV Attention for LLMs

Updated 15 November 2025
  • The paper introduces MoSKA, which rebalances unique and shared key-value processing to convert memory-bound operations into compute-bound GEMM calls, achieving up to 538.7× throughput improvement.
  • MoSKA partitions shared tokens using Mixture-of-Experts routing to achieve up to 75% sparsity, significantly reducing compute and memory utilization while focusing on semantically salient contexts.
  • MoSKA employs a disaggregated hardware design by separating Unique-KV and Shared-KV nodes, allowing independent scaling of memory bandwidth and compute resources for efficient long-sequence inference.

Mixture of Shared KV Attention (MoSKA) is an architectural paradigm for efficient long-sequence inference in LLMs, focused on overcoming the severe performance limitations introduced by memory-bound Key-Value (KV) cache operations as context lengths and batch sizes scale. MoSKA exploits workload heterogeneity by decomposing context data into per-request unique tokens and massively reused shared sequences, introducing compute-efficient shared KV processing and sparsification mechanisms. This entry synthesizes principal methods, mathematical formulations, hardware considerations, and empirical evaluation from the foundational sources.

1. KV Cache Bottlenecks in Long-Sequence LLMs

Modern LLM inference with multi-million-token contexts exposes a critical bottleneck: the KV cache. During autoregressive decoding, queries attend across all previous keys and values. Naïve implementations, including widely adopted strategies such as Grouped-Query Attention (GQA), quantization, and uniform sparsity, result in memory-bound General Matrix-Vector (GEMV) operations whose throughput scales poorly under increasing batch size BB and sequence length LL—see (Rhee et al., 8 Nov 2025).

Shared context tokens, such as fixed system prompts or legal corpora, though identical across requests, are traditionally cached per request, leading to redundant memory bandwidth consumption. As batch concurrency grows, GPUs become increasingly underutilized in compute while being throttled by memory bandwidth.

2. Shared KV Attention Transformation

MoSKA converts the bottlenecked KV cache processing into high-intensity compute operations by leveraging simultaneous access to shared context across requests. For NN concurrent requests attending to the same shared key (KsRM×dK_s \in \mathbb{R}^{M \times d}) and value (VsRM×dV_s \in \mathbb{R}^{M \times d}) matrices, naïve attention would instantiate NN GEMV calls for each query qiq_i: ai=softmax(qiKsTd),yi=aiVs,a_i = \mathrm{softmax}\left(\frac{q_i K_s^T}{\sqrt{d}}\right), \quad y_i = a_i V_s, streaming KsK_s and VsV_s from memory NN times.

MoSKA instead stacks all NN queries into a batched matrix QbRN×dQ_b \in \mathbb{R}^{N \times d} and utilizes two GEMM kernels: A=QbKsT,A=A/dA = Q_b K_s^T, \qquad A' = A / \sqrt{d}

P=softmax(A),Y=PVsP = \mathrm{softmax}(A'), \qquad Y = P V_s

or compactly,

Y=softmax(QbKsTd)VsY = \mathrm{softmax}\left(\frac{Q_b K_s^T}{\sqrt{d}}\right) V_s

This shifts arithmetic complexity from O(NMd)O(NMd) memory-bound to O(NMd)O(NMd) compute-bound, substantially improving GPU utilization. The transformation requires maximizing NN, data contiguity for QbQ_b, and persistent device memory for Ks,VsK_s, V_s, launching two optimized GEMM kernels.

3. Sparse Attention via Mixture-of-Experts Routing

Computing attention over all shared tokens remains prohibitive for extremely large contexts. MoSKA uses Mixture-of-Experts (MoE)-inspired routing to trim the active KV region. The shared cache is partitioned into CC “chunks” (experts), each with a representative embedding ejRde_j \in \mathbb{R}^d. Each query qiq_i scores all chunks: si,j=qiej,j=1,,Cs_{i,j} = q_i \cdot e_j, \qquad j = 1,\ldots,C and selects the top-kk most relevant: Si=arg topk(si,j)S_i = \operatorname{arg\,top}_k(s_{i,j}) Pseudocode for sparse attention:

1
2
3
4
5
6
7
8
9
Input: batch queries {q_i}, chunk embeddings {e_j}
for i in 1..N:
    scores = q_i @ E.T
    R_i = top_k(scores)
Collect all (i,j) pairs into Q_r, K_r, V_r
A = GEMM(Q_r, K_r.T) / sqrt(d)
P = row_softmax(A)
Y_r = GEMM(P, V_r)
scatter Y_r to output positions {y_i}
The final output per query: yi=jRisoftmax(qiKjTd)Vjy_i = \sum_{j \in R_i} \mathrm{softmax}\left(\frac{q_i K_j^T}{\sqrt{d}}\right) V_j This mechanism achieves up to 75% sparsity in routing for workloads with k/C25%k/C \approx 25\%, reducing compute and memory utilization while focusing on semantically salient chunks.

4. Disaggregated Hardware for Unique and Shared Context

To accommodate divergent memory and compute profiles, MoSKA proposes disaggregated serving infrastructure:

  • Unique-KV Nodes: Serve memory-bound per-request unique tokens, equipped with high-bandwidth memory (HBM2e/HBM3), and co-located feed-forward layers for latency hiding.
  • Shared-KV Nodes: Execute compute-bound batched GEMMs for shared chunks, with abundant tensor cores and persistent caching of shared context.

A scheduler routes incoming queries to the appropriate node type, allowing independent hardware scaling: adding Shared-KV nodes increases shared context bandwidth, whereas adding Unique-KV nodes serves more concurrent requests.

5. Empirical Results: Throughput and Scaling

MoSKA was benchmarked against FlashAttention, SGLang, LongHeads, and ChunkAttention using Llama 3.1 8B in FP8 (see (Rhee et al., 8 Nov 2025)). With Lu=64L_u = 64K unique tokens/request and Ls=1L_s = 1M–16M shared tokens, and a 35 tok/s/request target generation rate:

Method Peak batch Throughput Speedup vs FlashAttention
FlashAttention 16 1 × (Ref.) 1 ×
SGLang 64 8 × 8 ×
LongHeads 32 5 × 5 ×
ChunkAttention 128 120 × 120 ×
MoSKA 256 538.7 × 538.7 ×

MoSKA maintains >80%>80\% compute utilization on Shared-KV nodes (even for $16$M-token caches) and isolates low-MFU, memory-bound tasks on Unique-KV nodes. This suggests that, under high query concurrency and context sharing, MoSKA is able to scale attention throughput nearly 540-fold relative to FlashAttention.

6. Design Trade-offs and Limitations

MoSKA’s benefits are subject to several constraints:

  • Concurrency requirement: GEMM startup overhead is amortized only under moderate-to-high query concurrency (N32N \geq 32–64); low concurrency yields smaller benefits.
  • Memory overhead: The system incurs ~5–10% extra cache to maintain chunk embeddings and MoE routing structures.
  • Routing cost: Top-kk selection per query (cost O(NC)O(NC)) is manageable for moderate CC but requires optimized approximate nearest-neighbor algorithms if C1000C \gg 1000. A plausible implication is that workloads with predominantly unique tokens or low shared context utilization see reduced gains.

7. Future Directions

Active research avenues include:

  • Adaptive GPU scheduling across heterogeneous devices (compute-optimized, memory-optimized nodes).
  • Position-independent chunking (as in EPIC) for arbitrary composability (“Universal MoSKA”).
  • Hardware-level acceleration of routing/top-k and tensor slicing.

Collectively, MoSKA establishes a systematic blueprint for high-performance, resource-adaptive LLM inference under large context and high concurrency settings, explicitly leveraging context data heterogeneity to re-balance memory and compute utilization, and achieving both throughput and scaling advantages that are substantiated by analytical and experimental results (Rhee et al., 8 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Mixture of Shared KV Attention (MoSKA).