MoSKA: Shared KV Attention in LLMs

Updated 17 April 2026

MoSKA is an architecture that partitions per-request unique tokens and shared context to optimize LLM inference performance.
It employs batched GEMM and MoE-inspired sparse attention routing to convert memory-bound KV operations into compute-bound tasks.
The design achieves significant throughput improvements, with empirical results showing up to 538.7× performance gains over traditional baselines.

MoSKA (Mixture of Shared KV Attention) is an architecture for accelerating long-context LLM inference, specifically addressing the Key-Value (KV) cache bottleneck that emerges when handling massive context lengths. Traditional approaches incur severe memory-bound performance limitations as both the context length and the simultaneous batch size increase. MoSKA leverages the observation that real-world LLM workloads often contain a mixture of unique per-request sequences and highly reused shared context (such as system prompts or domain knowledge). By explicitly partitioning these categories and designing both algorithm and hardware accordingly, MoSKA achieves substantial throughput improvements, shifting shared-data KV operations from memory-bound to compute-bound regimes (Rhee et al., 8 Nov 2025).

1. Design Goals and Data Heterogeneity

MoSKA targets three primary goals:

Eliminate linear scaling of memory-bound KV cache accesses with increasing batch size, which typically stifles GPU utilization under large-context regimes.
Exploit the heterogeneity of context by treating per-request (unique) and multi-request (shared) sequences as separate categories—unique sequences are typically short (e.g., last 64 K tokens), while shared sequences represent very large, domain-wide corpora (1 M to 16 M tokens).
Co-design algorithmic and infrastructure layers to optimally serve these categories, converting shared-sequence attention into a compute-bound operation (via batching/GEMM), while retaining low-latency memory handling for unique context.

The approach is motivated by the empirical distribution of LLM inference deployments, where common system or knowledge prompts are attended across many requests but traditional attention mechanisms fail to exploit this redundancy.

2. Shared KV Attention Mechanism

2.1 Conventional Attention (Unique Context)

Given a request $i$ with $Q_i \in \mathbb{R}^{L_i \times d}$ (query), $K_i \in \mathbb{R}^{L_i \times d}$ (key), and $V_i \in \mathbb{R}^{L_i \times d}$ (value), attention is computed by

$A_i = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d}}\right), \quad O_i = A_i V_i$

Computation across $L_i$ positions is executed as $L_i$ GEMV (matrix-vector) operations, which are memory-bound.

2.2 Batched Shared Attention via GEMM

For $N$ requests attending a shared cache $(K_s, V_s) \in \mathbb{R}^{M \times d}$ , stack all queries into $Q_\text{batch} \in \mathbb{R}^{N \times d}$ . Shared-key attention becomes: $Q_i \in \mathbb{R}^{L_i \times d}$ 0

$Q_i \in \mathbb{R}^{L_i \times d}$ 1

This formulation converts $Q_i \in \mathbb{R}^{L_i \times d}$ 2 memory-bound GEMVs into a single GEMM (matrix-matrix multiplication), whose arithmetic intensity $Q_i \in \mathbb{R}^{L_i \times d}$ 3 enables compute-bound performance on modern hardware.

3. MoE-Inspired Sparse Attention Pruning

To scale shared attention when $Q_i \in \mathbb{R}^{L_i \times d}$ 4, MoSKA incorporates a Mixture-of-Experts (MoE) style routing mechanism:

Chunk Partitioning: Partition the shared cache into $Q_i \in \mathbb{R}^{L_i \times d}$ 5 disjoint chunks, each of length $Q_i \in \mathbb{R}^{L_i \times d}$ 6. Precompute embeddings $Q_i \in \mathbb{R}^{L_i \times d}$ 7 for each chunk.
Routing: For each query $Q_i \in \mathbb{R}^{L_i \times d}$ 8, compute relevance $Q_i \in \mathbb{R}^{L_i \times d}$ 9 for all $K_i \in \mathbb{R}^{L_i \times d}$ 0. Select top- $K_i \in \mathbb{R}^{L_i \times d}$ 1 chunks via $K_i \in \mathbb{R}^{L_i \times d}$ 2.
Pruned Attention: Concatenate $K_i \in \mathbb{R}^{L_i \times d}$ 3 for $K_i \in \mathbb{R}^{L_i \times d}$ 4, and apply shared KV attention only within these $K_i \in \mathbb{R}^{L_i \times d}$ 5 chunks:

$K_i \in \mathbb{R}^{L_i \times d}$ 6

This reduces computation from $K_i \in \mathbb{R}^{L_i \times d}$ 7 to $K_i \in \mathbb{R}^{L_i \times d}$ 8. Chunk selection is non-parametric, relying on embedding similarity rather than gradient-based routing.

4. Disaggregated Inference Infrastructure

MoSKA separates hardware resources into two node types:

Node Type	Specialized For	Hardware/Operation Highlights
Unique-KV Node	Per-request unique KV attention	High-bandwidth HBM; FFN co-locate
Shared-KV Node	Compute-bound shared KV attention (GEMM)	Large shared cache; GEMM cores

Data Flow:

Unique-KV node receives user tokens, produces $K_i \in \mathbb{R}^{L_i \times d}$ 9.
Routing for top- $V_i \in \mathbb{R}^{L_i \times d}$ 0 chunk indices ( $V_i \in \mathbb{R}^{L_i \times d}$ 1) is computed.
$V_i \in \mathbb{R}^{L_i \times d}$ 2 and $V_i \in \mathbb{R}^{L_i \times d}$ 3 are sent to Shared-KV node.
Shared-KV node performs batched GEMM, returns $V_i \in \mathbb{R}^{L_i \times d}$ 4.
Unique-KV node applies final projection/FFN and emits tokens.

This disaggregation ensures low-latency unique context handling is not penalized by large-batch shared attention tasks, and allows specialization—Shared-KV nodes remain compute-bound, exceeding 80% MFU (Multiply Functional Unit utilization) as batch and shared-context size grow.

5. Quantitative Performance Results

Empirical evaluation uses LLaMA 3.1-8B (FP8), sparse attention with 75% sparsity ( $V_i \in \mathbb{R}^{L_i \times d}$ 5), and 2× NVIDIA DGX H200 nodes.

Context/Batch Regime: Each request sees 64 K unique tokens, supplemented by 1–16 M shared tokens. Throughput target: 35 tokens/s per request.
Baselines: FlashAttention, SGLang, LongHeads, ChunkAttention.

Metric	MoSKA Result	Baseline Results
Max batch size	Greatly exceeds all baselines	Limited by shared KV
Throughput	Up to 538.7× baseline (at high sharing)	Orders-of-magnitude lower
Shared-KV MFU	$V_i \in \mathbb{R}^{L_i \times d}$ 6 at $V_i \in \mathbb{R}^{L_i \times d}$ 7 tokens (compute-bound)	Lower, memory-bound
Unique-KV MFU	Memory-bound, remains low-latency	Not explicitly separated

Routing overhead is negligible when $V_i \in \mathbb{R}^{L_i \times d}$ 8; throughput benefit scales with degree of context sharing.

6. Limitations and Future Directions

MoSKA’s efficacy is contingent on high shared-context prevalence; workloads dominated by unique context see little gain. The non-parametric router may miss optimal shared chunks when semantic overlap is subtle, limiting pruning effectiveness. Disaggregation introduces additional system and networking complexity.

Identified potential bottlenecks are:

Inter-node networking for $V_i \in \mathbb{R}^{L_i \times d}$ 9 and routing indices, especially at low batch.
Router compute cost when $A_i = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d}}\right), \quad O_i = A_i V_i$ 0 (chunk count) is extremely large ( $A_i = \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d}}\right), \quad O_i = A_i V_i$ 1).
Memory replication for large caches across multiple Shared-KV nodes.

Future work includes trainable or hybrid (learned+static) routers to improve semantic chunk selection, introducing position-independent “Universal MoSKA” for arbitrary chunk ordering, developing special-purpose interconnects to reduce data movement, and hardware primitives that fuse the routing and GEMM steps (Rhee et al., 8 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoSKA.

MoSKA: Shared KV Attention in LLMs

1. Design Goals and Data Heterogeneity

2. Shared KV Attention Mechanism

2.1 Conventional Attention (Unique Context)

2.2 Batched Shared Attention via GEMM

3. MoE-Inspired Sparse Attention Pruning

4. Disaggregated Inference Infrastructure

5. Quantitative Performance Results

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MoSKA: Shared KV Attention in LLMs

1. Design Goals and Data Heterogeneity

2. Shared KV Attention Mechanism

2.1 Conventional Attention (Unique Context)

2.2 Batched Shared Attention via GEMM

3. MoE-Inspired Sparse Attention Pruning

4. Disaggregated Inference Infrastructure

5. Quantitative Performance Results

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research