MoSKA: Shared KV Attention in LLMs
- MoSKA is an architecture that partitions per-request unique tokens and shared context to optimize LLM inference performance.
- It employs batched GEMM and MoE-inspired sparse attention routing to convert memory-bound KV operations into compute-bound tasks.
- The design achieves significant throughput improvements, with empirical results showing up to 538.7× performance gains over traditional baselines.
MoSKA (Mixture of Shared KV Attention) is an architecture for accelerating long-context LLM inference, specifically addressing the Key-Value (KV) cache bottleneck that emerges when handling massive context lengths. Traditional approaches incur severe memory-bound performance limitations as both the context length and the simultaneous batch size increase. MoSKA leverages the observation that real-world LLM workloads often contain a mixture of unique per-request sequences and highly reused shared context (such as system prompts or domain knowledge). By explicitly partitioning these categories and designing both algorithm and hardware accordingly, MoSKA achieves substantial throughput improvements, shifting shared-data KV operations from memory-bound to compute-bound regimes (Rhee et al., 8 Nov 2025).
1. Design Goals and Data Heterogeneity
MoSKA targets three primary goals:
- Eliminate linear scaling of memory-bound KV cache accesses with increasing batch size, which typically stifles GPU utilization under large-context regimes.
- Exploit the heterogeneity of context by treating per-request (unique) and multi-request (shared) sequences as separate categories—unique sequences are typically short (e.g., last 64 K tokens), while shared sequences represent very large, domain-wide corpora (1 M to 16 M tokens).
- Co-design algorithmic and infrastructure layers to optimally serve these categories, converting shared-sequence attention into a compute-bound operation (via batching/GEMM), while retaining low-latency memory handling for unique context.
The approach is motivated by the empirical distribution of LLM inference deployments, where common system or knowledge prompts are attended across many requests but traditional attention mechanisms fail to exploit this redundancy.
2. Shared KV Attention Mechanism
2.1 Conventional Attention (Unique Context)
Given a request with (query), (key), and (value), attention is computed by
Computation across positions is executed as GEMV (matrix-vector) operations, which are memory-bound.
2.2 Batched Shared Attention via GEMM
For requests attending a shared cache , stack all queries into . Shared-key attention becomes: 0
1
This formulation converts 2 memory-bound GEMVs into a single GEMM (matrix-matrix multiplication), whose arithmetic intensity 3 enables compute-bound performance on modern hardware.
3. MoE-Inspired Sparse Attention Pruning
To scale shared attention when 4, MoSKA incorporates a Mixture-of-Experts (MoE) style routing mechanism:
- Chunk Partitioning: Partition the shared cache into 5 disjoint chunks, each of length 6. Precompute embeddings 7 for each chunk.
- Routing: For each query 8, compute relevance 9 for all 0. Select top-1 chunks via 2.
- Pruned Attention: Concatenate 3 for 4, and apply shared KV attention only within these 5 chunks:
6
This reduces computation from 7 to 8. Chunk selection is non-parametric, relying on embedding similarity rather than gradient-based routing.
4. Disaggregated Inference Infrastructure
MoSKA separates hardware resources into two node types:
| Node Type | Specialized For | Hardware/Operation Highlights |
|---|---|---|
| Unique-KV Node | Per-request unique KV attention | High-bandwidth HBM; FFN co-locate |
| Shared-KV Node | Compute-bound shared KV attention (GEMM) | Large shared cache; GEMM cores |
- Data Flow:
- Unique-KV node receives user tokens, produces 9.
- Routing for top-0 chunk indices (1) is computed.
- 2 and 3 are sent to Shared-KV node.
- Shared-KV node performs batched GEMM, returns 4.
- Unique-KV node applies final projection/FFN and emits tokens.
This disaggregation ensures low-latency unique context handling is not penalized by large-batch shared attention tasks, and allows specialization—Shared-KV nodes remain compute-bound, exceeding 80% MFU (Multiply Functional Unit utilization) as batch and shared-context size grow.
5. Quantitative Performance Results
Empirical evaluation uses LLaMA 3.1-8B (FP8), sparse attention with 75% sparsity (5), and 2× NVIDIA DGX H200 nodes.
- Context/Batch Regime: Each request sees 64 K unique tokens, supplemented by 1–16 M shared tokens. Throughput target: 35 tokens/s per request.
- Baselines: FlashAttention, SGLang, LongHeads, ChunkAttention.
| Metric | MoSKA Result | Baseline Results |
|---|---|---|
| Max batch size | Greatly exceeds all baselines | Limited by shared KV |
| Throughput | Up to 538.7× baseline (at high sharing) | Orders-of-magnitude lower |
| Shared-KV MFU | 6 at 7 tokens (compute-bound) | Lower, memory-bound |
| Unique-KV MFU | Memory-bound, remains low-latency | Not explicitly separated |
Routing overhead is negligible when 8; throughput benefit scales with degree of context sharing.
6. Limitations and Future Directions
MoSKA’s efficacy is contingent on high shared-context prevalence; workloads dominated by unique context see little gain. The non-parametric router may miss optimal shared chunks when semantic overlap is subtle, limiting pruning effectiveness. Disaggregation introduces additional system and networking complexity.
Identified potential bottlenecks are:
- Inter-node networking for 9 and routing indices, especially at low batch.
- Router compute cost when 0 (chunk count) is extremely large (1).
- Memory replication for large caches across multiple Shared-KV nodes.
Future work includes trainable or hybrid (learned+static) routers to improve semantic chunk selection, introducing position-independent “Universal MoSKA” for arbitrary chunk ordering, developing special-purpose interconnects to reduce data movement, and hardware primitives that fuse the routing and GEMM steps (Rhee et al., 8 Nov 2025).