Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShadowKV: Secure & Efficient LLM Inference

Updated 27 March 2026
  • ShadowKV is a framework that addresses privacy risks from cleartext KV caches while enhancing inference throughput via low-rank factorization and selective cache reconstruction.
  • It implements the KV-Cloak defense, a reversible matrix obfuscation method that renders direct inversion and collision attacks ineffective on token representations.
  • ShadowKV optimizes memory usage by offloading values to CPU and employing sparse on-the-fly KV selection, enabling scalable, high-performance long-context inference.

ShadowKV refers to both a class of privacy and leakage risks in LLM inference stemming from the exfiltration or shadowing of KV (Key-Value) caches, as well as a family of high-throughput, memory-optimized inference systems leveraging novel KV cache partitioning and selection strategies. These works address distinct but complementary challenges in LLM systems: the privacy risk of cleartext KV-caches, and the efficiency bottlenecks of cache scaling with sequence length and batch size.

1. KV-cache Fundamentals and ShadowKV Threat Model

In transformer-based LLMs, the KV-cache is a persistent data structure storing per-layer key (Kℓ∈Rn×dK_\ell\in\mathbb{R}^{n\times d}) and value (Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}) matrices for all processed tokens, critical for accelerating autoregressive decoding by obviating redundant computation of self-attention projections. At autoregressive decoding step ii in layer ℓ\ell:

  • qâ„“,i=(xiWqℓ⊤)Râ„“,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i} (query)
  • kâ„“,i=(xiWkℓ⊤)Râ„“,ik_{\ell,i} = (x_i W_k^\ell{}^\top) R_{\ell,i} (key)
  • vâ„“,i=xiWvℓ⊤v_{\ell,i} = x_i W_v^\ell{}^\top (value)

where xix_i is the token embedding, Wqℓ,Wkℓ,WvℓW_q^\ell, W_k^\ell, W_v^\ell are projection matrices, Rℓ,iR_{\ell,i} encodes RoPE, and Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}0 is the attention hidden size.

The ShadowKV threat model arises when an adversary gains direct, plaintext access to all or part of Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}1 during or after inference. Such attacks assume content-level exfiltration (e.g., insecure memory sharing, unencrypted networking), giving an adversary latent representations of all processed tokens and potentially the public model weights and embeddings. This is fundamentally distinct from timing or cache-side channels: the leakage is complete at the granularity of decoded cache entries (Luo et al., 13 Aug 2025).

2. Attack Vectors Exploiting Shadowed KV-caches

Three principled attack vectors have been demonstrated against shadowed KV-caches:

2.1 Direct Inversion Attack

When Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}2, Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}3 are square and invertible (early MHA architectures), the attacker can reconstruct the original input embedding Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}4:

  • From keys: Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}5
  • From values: Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}6

This attack is highly effective on the first decoder layer.

2.2 Collision Attack

In general, input recovery can be framed as a matching/search problem: for each position Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}7,

Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}8

where Vℓ∈Rn×dV_\ell\in\mathbb{R}^{n\times d}9 results from simulating candidate prefixes ii0. Batch sampling and outlier detection yield efficient, >90% per-token recovery rates for typical LLMs.

2.3 Injection Attack

By leveraging LLM autoregression, an attacker can append an instruction (e.g., "Repeat the previous content.") in a new prompt using the shadowed cache; the LLM echoes the user's prompt in a single forward pass, with moderate reconstruction fidelity (BERTScore ≈ 0.58, ROUGE-L ≈ 0.42) (Luo et al., 13 Aug 2025).

3. Privacy Mitigation: The KV-Cloak Defense

KV-Cloak is a reversible, matrix-based obfuscation scheme designed to make shadowed KV-caches cryptographically useless:

  • For each cache block (size ii1), three invertible transformations are applied:
    • ii2 and ii3, where ii4 is a block-level invertible map, ii5 is a one-time random permutation, ii6 is a sparse additive mask for positional encoding, and ii7 is a feature-space obfuscator.
  • De-obfuscation involves ii8, ii9, â„“\ell0, and permutation reversal.

Operator fusion reduces runtime overhead by integrating â„“\ell1, â„“\ell2 into attention weights offline, yielding only â„“\ell3 overhead per block online, negligible when â„“\ell4.

In both theory and practice, KV-Cloak renders inversion and collision attacks ineffective: under KV-Cloak, both BERTScore and ROUGE-L for reconstructed content fall to random chance (≤0.10 and ≈0.00, respectively), with no measurable loss in accuracy (MMLU/SQuAD) and ≤10% inference latency overhead (Luo et al., 13 Aug 2025).

4. High-Throughput Inference: ShadowKV's Memory-Optimized Cache Strategy

The "ShadowKV" system for efficient long-context inference addresses the memory and throughput bottleneck due to large KV-caches:

  • The standard dense cache (â„“\ell5, â„“\ell6) is split:
    • Keys are represented in a low-rank factorization (â„“\ell7, with â„“\ell8 and â„“\ell9) and stored on GPU
    • Values are offloaded to CPU memory (Sun et al., 2024)

GPU memory usage is reduced by qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}0–qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}1 compared to dense storage.

Sparse on-the-fly KV selection reconstructs only a minimal subset (using chunk-level landmarks and a small outlier set), enabling batch sizes up to qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}2 larger and throughput up to qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}3 higher than full-attention with bounded GPU memory, without accuracy degradation: kℓ,i=(xiWkℓ⊤)Rℓ,ik_{\ell,i} = (x_i W_k^\ell{}^\top) R_{\ell,i}1 Benchmarks across Llama, GLM, Yi, Phi, and Qwen2 models confirm that ShadowKV achieves full-attention accuracy at <2% of the dense KV budget for qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}4128K-token contexts (Sun et al., 2024).

5. Empirical Results

The privacy mitigation suite (Luo et al., 13 Aug 2025) demonstrates:

  • Attack Performance: Plaintext collision attack BERTScore = 0.77, ROUGE-L = 0.56 (LLaMA-7B). Under KV-Cloak, BERTScore drops to 0.07, ROUGE-L to 0.00.
  • Accuracy Preservation: MMLU and SQuAD scores unchanged under KV-Cloak (e.g., LLaMA-7B, MMLU = 30.4% → 30.4%).
  • Latency Overhead: KV-Cloak incurs 2–10% overhead on real workloads (e.g., LLaMA-7B, +4.5%; LLaMA-3.2-1B, +10.2%).

The high-throughput ShadowKV implementation (Sun et al., 2024):

  • Accuracy: On RULER@128K, full attention = 85.5%, ShadowKV = 83.6% (Llama-3.1-8B); on LongBench, full attention = 48.96%, ShadowKV = 48.13%.
  • Throughput: On A100@122K context, batch size 24, ShadowKV reaches 245.9 tokens/s (3.04× gain).
  • Scalability: ShadowKV supports up to qâ„“,i=(xiWqℓ⊤)Râ„“,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}5 larger batch sizes than the full-attention configuration.

6. Implementation and Practical Considerations

KV-Cloak integrates at the cache block level, supporting block sizes qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}6 with negligible variation in accuracy or latency. Operator fusion achieves an qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}7 reduction in runtime cost compared to naïve transformations. The system is compatible with multi-head attention (MHA), grouped-query attention (GQA), and multi-layer attention (MLA), and can be deployed with both PagedAttention and standard inference frameworks (Luo et al., 13 Aug 2025).

ShadowKV leverages advanced memory management (e.g., pinned host memory and multiple CUDA streams for overlapping computation and transfer), and its selection/reconstruction is compatible with available high-performance kernels (e.g., CUTLASS and FlashAttention libraries), as well as batch processing for multi-model deployments (Sun et al., 2024).

7. Trade-offs, Limitations, and Future Directions

Parameter Sensitivity: In ShadowKV, the choice of low-rank qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}8 balances accuracy and memory efficiency; the chunk size qℓ,i=(xiWqℓ⊤)Rℓ,iq_{\ell,i} = (x_i W_q^\ell{}^\top) R_{\ell,i}9 tunes selection cost versus granularity. Smaller kℓ,i=(xiWkℓ⊤)Rℓ,ik_{\ell,i} = (x_i W_k^\ell{}^\top) R_{\ell,i}0 increases memory savings but may reduce fidelity for certain tasks.

Limitations: CPU value offload, even with overlap, introduces PCIe bottlenecks not entirely eliminated. SVD for key projection is currently offline or asynchronous; full real-time updates remain an open area.

Extensibility: Both frameworks anticipate improvements via mixed-precision factorization, per-head dynamic sparsity budgets, and adaptive decoding-time updates.

Security Scope: A plausible implication is deployment best practices should incorporate both privacy-protective KV-cache obfuscation and efficient cache partitioning for scalable, trustworthy, long-context LLM inference.

References: (Sun et al., 2024, Luo et al., 13 Aug 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShadowKV.