Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoSKA: Shared KV Attention in LLMs

Updated 17 April 2026
  • MoSKA is an architecture that partitions per-request unique tokens and shared context to optimize LLM inference performance.
  • It employs batched GEMM and MoE-inspired sparse attention routing to convert memory-bound KV operations into compute-bound tasks.
  • The design achieves significant throughput improvements, with empirical results showing up to 538.7× performance gains over traditional baselines.

MoSKA (Mixture of Shared KV Attention) is an architecture for accelerating long-context LLM inference, specifically addressing the Key-Value (KV) cache bottleneck that emerges when handling massive context lengths. Traditional approaches incur severe memory-bound performance limitations as both the context length and the simultaneous batch size increase. MoSKA leverages the observation that real-world LLM workloads often contain a mixture of unique per-request sequences and highly reused shared context (such as system prompts or domain knowledge). By explicitly partitioning these categories and designing both algorithm and hardware accordingly, MoSKA achieves substantial throughput improvements, shifting shared-data KV operations from memory-bound to compute-bound regimes (Rhee et al., 8 Nov 2025).

1. Design Goals and Data Heterogeneity

MoSKA targets three primary goals:

  1. Eliminate linear scaling of memory-bound KV cache accesses with increasing batch size, which typically stifles GPU utilization under large-context regimes.
  2. Exploit the heterogeneity of context by treating per-request (unique) and multi-request (shared) sequences as separate categories—unique sequences are typically short (e.g., last 64 K tokens), while shared sequences represent very large, domain-wide corpora (1 M to 16 M tokens).
  3. Co-design algorithmic and infrastructure layers to optimally serve these categories, converting shared-sequence attention into a compute-bound operation (via batching/GEMM), while retaining low-latency memory handling for unique context.

The approach is motivated by the empirical distribution of LLM inference deployments, where common system or knowledge prompts are attended across many requests but traditional attention mechanisms fail to exploit this redundancy.

2. Shared KV Attention Mechanism

2.1 Conventional Attention (Unique Context)

Given a request ii with QiRLi×dQ_i \in \mathbb{R}^{L_i \times d} (query), KiRLi×dK_i \in \mathbb{R}^{L_i \times d} (key), and ViRLi×dV_i \in \mathbb{R}^{L_i \times d} (value), attention is computed by

Ai=softmax(QiKiTd),Oi=AiViA_i = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d}}\right), \quad O_i = A_i V_i

Computation across LiL_i positions is executed as LiL_i GEMV (matrix-vector) operations, which are memory-bound.

2.2 Batched Shared Attention via GEMM

For NN requests attending a shared cache (Ks,Vs)RM×d(K_s, V_s) \in \mathbb{R}^{M \times d}, stack all queries into QbatchRN×dQ_\text{batch} \in \mathbb{R}^{N \times d}. Shared-key attention becomes: QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}0

QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}1

This formulation converts QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}2 memory-bound GEMVs into a single GEMM (matrix-matrix multiplication), whose arithmetic intensity QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}3 enables compute-bound performance on modern hardware.

3. MoE-Inspired Sparse Attention Pruning

To scale shared attention when QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}4, MoSKA incorporates a Mixture-of-Experts (MoE) style routing mechanism:

  • Chunk Partitioning: Partition the shared cache into QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}5 disjoint chunks, each of length QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}6. Precompute embeddings QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}7 for each chunk.
  • Routing: For each query QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}8, compute relevance QiRLi×dQ_i \in \mathbb{R}^{L_i \times d}9 for all KiRLi×dK_i \in \mathbb{R}^{L_i \times d}0. Select top-KiRLi×dK_i \in \mathbb{R}^{L_i \times d}1 chunks via KiRLi×dK_i \in \mathbb{R}^{L_i \times d}2.
  • Pruned Attention: Concatenate KiRLi×dK_i \in \mathbb{R}^{L_i \times d}3 for KiRLi×dK_i \in \mathbb{R}^{L_i \times d}4, and apply shared KV attention only within these KiRLi×dK_i \in \mathbb{R}^{L_i \times d}5 chunks:

KiRLi×dK_i \in \mathbb{R}^{L_i \times d}6

This reduces computation from KiRLi×dK_i \in \mathbb{R}^{L_i \times d}7 to KiRLi×dK_i \in \mathbb{R}^{L_i \times d}8. Chunk selection is non-parametric, relying on embedding similarity rather than gradient-based routing.

4. Disaggregated Inference Infrastructure

MoSKA separates hardware resources into two node types:

Node Type Specialized For Hardware/Operation Highlights
Unique-KV Node Per-request unique KV attention High-bandwidth HBM; FFN co-locate
Shared-KV Node Compute-bound shared KV attention (GEMM) Large shared cache; GEMM cores
  • Data Flow:
  1. Unique-KV node receives user tokens, produces KiRLi×dK_i \in \mathbb{R}^{L_i \times d}9.
  2. Routing for top-ViRLi×dV_i \in \mathbb{R}^{L_i \times d}0 chunk indices (ViRLi×dV_i \in \mathbb{R}^{L_i \times d}1) is computed.
  3. ViRLi×dV_i \in \mathbb{R}^{L_i \times d}2 and ViRLi×dV_i \in \mathbb{R}^{L_i \times d}3 are sent to Shared-KV node.
  4. Shared-KV node performs batched GEMM, returns ViRLi×dV_i \in \mathbb{R}^{L_i \times d}4.
  5. Unique-KV node applies final projection/FFN and emits tokens.

This disaggregation ensures low-latency unique context handling is not penalized by large-batch shared attention tasks, and allows specialization—Shared-KV nodes remain compute-bound, exceeding 80% MFU (Multiply Functional Unit utilization) as batch and shared-context size grow.

5. Quantitative Performance Results

Empirical evaluation uses LLaMA 3.1-8B (FP8), sparse attention with 75% sparsity (ViRLi×dV_i \in \mathbb{R}^{L_i \times d}5), and 2× NVIDIA DGX H200 nodes.

  • Context/Batch Regime: Each request sees 64 K unique tokens, supplemented by 1–16 M shared tokens. Throughput target: 35 tokens/s per request.
  • Baselines: FlashAttention, SGLang, LongHeads, ChunkAttention.
Metric MoSKA Result Baseline Results
Max batch size Greatly exceeds all baselines Limited by shared KV
Throughput Up to 538.7× baseline (at high sharing) Orders-of-magnitude lower
Shared-KV MFU ViRLi×dV_i \in \mathbb{R}^{L_i \times d}6 at ViRLi×dV_i \in \mathbb{R}^{L_i \times d}7 tokens (compute-bound) Lower, memory-bound
Unique-KV MFU Memory-bound, remains low-latency Not explicitly separated

Routing overhead is negligible when ViRLi×dV_i \in \mathbb{R}^{L_i \times d}8; throughput benefit scales with degree of context sharing.

6. Limitations and Future Directions

MoSKA’s efficacy is contingent on high shared-context prevalence; workloads dominated by unique context see little gain. The non-parametric router may miss optimal shared chunks when semantic overlap is subtle, limiting pruning effectiveness. Disaggregation introduces additional system and networking complexity.

Identified potential bottlenecks are:

  • Inter-node networking for ViRLi×dV_i \in \mathbb{R}^{L_i \times d}9 and routing indices, especially at low batch.
  • Router compute cost when Ai=softmax(QiKiTd),Oi=AiViA_i = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d}}\right), \quad O_i = A_i V_i0 (chunk count) is extremely large (Ai=softmax(QiKiTd),Oi=AiViA_i = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d}}\right), \quad O_i = A_i V_i1).
  • Memory replication for large caches across multiple Shared-KV nodes.

Future work includes trainable or hybrid (learned+static) routers to improve semantic chunk selection, introducing position-independent “Universal MoSKA” for arbitrary chunk ordering, developing special-purpose interconnects to reduce data movement, and hardware primitives that fuse the routing and GEMM steps (Rhee et al., 8 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoSKA.