CollectiveKV: Efficient Cache Management
- CollectiveKV is a set of techniques for efficient key-value cache management in deep sequence models, using shared and fused caches to cut memory usage.
- It decomposes KV matrices into user-specific and global components with methods like SVD and routing networks, achieving extreme compression with minimal performance loss.
- The framework extends to cross-layer fusion and cooperative budget allocation, enhancing throughput in both sequential recommendation systems and LLM inference.
CollectiveKV denotes a family of methods and mechanisms for efficient key-value (KV) cache management in deep sequence models, particularly Transformers, via collaborative sharing, joint encoding, or cooperative allocation. CollectiveKV approaches aim to mitigate the severe memory and latency bottlenecks of KV-caching in large-scale inference by leveraging inter-user or inter-session similarities, cross-layer redundancies, and the collaborative utility of cache elements. The principle unifying these techniques is the identification, decomposition, or fusion of "collective" (shared, reusable, or synergistically important) KV information, thereby radically compressing the cache footprint or improving cross-session/model efficiency while maintaining or enhancing prediction fidelity.
1. Motivations and Background
Sequential recommendation systems and LLMs relying on deep attention architectures encounter prohibitive memory and latency costs due to the need to cache full K and V matrices per-user or per-session for rapid inference. Formally, the per-user cache size in such settings is , where is user interaction sequence length and is attention head dimension. With hundreds of millions of users or lengthy session contexts, aggregate cache requirements commonly exceed available accelerator memory, necessitating suboptimal offloading. Empirical analysis reveals that substantial fractions of this KV structure are either highly redundant across users or only weakly contribute to task-level performance, motivating both inter-user sharing and budget-aware allocation (Li et al., 27 Jan 2026).
Conventional cache compression or pruning approaches, such as rank-based head masking or quantization, neglect either cross-user redundancy or the cooperative influence of KV partitions. The CollectiveKV paradigm addresses these gaps via mechanisms that systematically extract, share, or prioritize only the most collectively valuable KV components, leading to extreme compression ratios or latency reductions without performance trade-offs (Sun et al., 21 Feb 2025).
2. Decoupling and Pool-Based KV Sharing
The core method introduced in "CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation" proposes a cross-user KV sharing scheme based on a decomposition of KV spaces into shareable (collective) and user-specific components. SVD analysis of user-wise key matrices reveals that the leading -dimensional principal subspace () exhibits strong cross-user similarity, whereas the residual () encodes largely user-specific detail. Thus, storing only a low-dimensional user-specific KV () and routing most information through a global shared KV pool (, slots) suffices (Li et al., 27 Jan 2026).
Key architectural components:
- Global KV pool: Learnable tensors storing collective patterns. All users share this pool.
- Router network: For each user token, a small projection and argmax over the global pool indices selects the most appropriate shared slot per position.
- User-specific KV: Small projections , typically set to of for accuracy-compression optimality.
- Concatenation and attention: For each token, the full key/value comprises the concatenation of user-specific and shared (retrieved from the pool via the router) parts: .
This factored representation is trained end-to-end using standard objectives (e.g., logloss for CTR) and specialized router regularizers for utilization and balance. During inference, per-user cache size reduces from to , plus a single global pool () held in GPU memory. For appropriate hyperparameters (), the realized compression ratio has been demonstrated, with performance on AUC/GAUC and logloss matching or slightly exceeding dense-cached baselines across state-of-the-art recommendation models (Li et al., 27 Jan 2026).
3. Cross-Layer and Cross-Session KV Fusion
Several recent advances extend CollectiveKV concepts to fusion across model layers or requests:
- Cross-layer fusion (FusedKV / FusedKV-Lite): In LLMs, the upper-layer KV caches are reconstructed as learnable fusions of KV from a small subset of lower layers (typically, the bottom and middle). FusedKV introduces channel-wise learnable weights for each reconstruction layer , such that , and similarly for , enforcing 2D-diagonal constraints to preserve RoPE compatibility. FusedKV-Lite replaces all upper-layer KV with KV from storage layers directly—no learnable weights—yielding even greater I/O efficiency. Both halve memory costs and achieve better perplexity/accuracy than classical cross-layer approaches, with FusedKV offering additional quality benefits (Lin et al., 3 Dec 2025).
- Joint session/request encoding: Tree-based or batch-based fusion methods jointly encode blocks of KV caches (across requests or input chunks) using a cosine similarity threshold to identify merges. A balanced FastFusion tree recursively fuses blocks exceeding threshold, with Poisson point process analysis predicting the achievable compression rate versus allowed attention distortion. Compression rates of up to have been reported with sub-0.5 point drops in F1 or task-score, along with throughput improvements on vLLM serving (Kampeas et al., 6 Jan 2026).
4. Cooperative and Task-Aware KV Budget Allocation
In "CoKV: Optimizing KV Cache Allocation via Cooperative Game," cache allocation across attention heads or GQA groups is modeled as a cooperative game, where each cache unit’s true utility is measured not in isolation but via marginal contributions to downstream performance in combination with others (Shapley value framework). Exact computation is infeasible due to combinatorial complexity, so CoKV employs:
- Complementary-contribution SSV (Sliced Shapley Value): Fast estimation of collaborative importance by evaluating a limited set of coalition sizes and using performance differentials between held-out cache subsets.
- Budget allocation: Heads/groups are assigned cache sizes proportional to their normalized SSV, with low-importance groups pruned. Each head/group retains a minimum local window and only the most important tokens (by softmax-attention scoring) in its cache.
This approach achieves near-lossless cache compression at extreme ratios (e.g., of full cache achieves – of baseline accuracy; at higher budgets, even slight improvements by removing negative-contributing groups). Task-adaptivity is attained by recomputing SSVs per downstream domain, enabling robust deployment in diverse LLM inference workloads (Sun et al., 21 Feb 2025).
5. Communication and Sharing in Multi-Agent and Multi-Model Systems
CollectiveKV ideas also underpin recent methods for efficient cache utilization in multi-agent LLM systems and inter-model communication:
- KVCOMM frameworks: KV caches are reused across agents under different context prefixes, with anchor-based alignment accounting for RoPE-induced misalignments and context-dependent offsets. By maintaining online-updated anchor pools, KVCOMM predicts, retrieves, and linearly interpolates offsets for shared segments. This enables up to TTFT speedups and – reuse without significant accuracy loss in deeply pipelined multi-agent tasks (Ye et al., 14 Oct 2025).
- Layerwise selective sharing (KVComm): For inter-LLM communication, only a fraction of layers' KV pairs (selected by normalized attention importance and a Gaussian prior centered at middle layers) are serialized and shared, reducing bandwidth and FLOPs by –. This achieves up to of maximal (skyline) task performance even with just of layers’ KV pairs transmitted. Receiver models seamlessly attend to both local and shared foreign cache contexts (Shi et al., 2 Oct 2025).
6. Experimental Evaluations and Comparative Analysis
CollectiveKV variants have undergone extensive empirical comparison on diverse architectures and datasets, including sequential recommenders (SIM, SDIM, ETA, TWIN, HSTU on MicroVideo, KuaiVideo, EBNeRD-Small) and LLMs (Llama-3, Qwen, Mistral on LongBench, MMLU, GSM8K). Across these, the dominant findings are:
| Method | Core Mechanism | Peak Compression | Accuracy Impact |
|---|---|---|---|
| Pool-based | Cross-user pool sharing | – improvement (Li et al., 27 Jan 2026) | |
| Cross-layer | FusedKV/FusedKV-Lite | Lower perplexity (Lin et al., 3 Dec 2025) | |
| Joint encoding | Tree/Batch fusion | $2$– | pt drop (Kampeas et al., 6 Jan 2026) |
| CoKV | Coop-game allocation | $15$– | $97$– recovery (Sun et al., 21 Feb 2025) |
| KVCOMM | Anchor reuse, alignment | TTFT | drop (Ye et al., 14 Oct 2025) |
| Layer sharing | Attention+prior selection | comm | – skyline (Shi et al., 2 Oct 2025) |
Key technical observations:
- There exists a strong empirical and theoretical basis for the existence of highly redundant, collectively reusable KV information both in user-space and model-space.
- Fine-tuned or adaptive selection of pooling/routing dimensions, anchor update criteria, and cooperative coalition sets is crucial for optimal tradeoffs.
- Proper regularization or normalization (e.g., balancing collective pool utilization, pruning negative collaborating heads) is essential to prevent degenerate solutions or performance collapse.
7. Limitations, Open Problems, and Future Directions
Although the CollectiveKV paradigm has demonstrated robust scalability, several challenges and open research directions remain:
- Automated or adaptive hyperparameter selection for pool dimensions () and fusion dimensions.
- Long-tail adaptation: Users or requests with short or idiosyncratic histories often benefit less from sharing-based methods; dynamic pool expansion or per-session adaptation is underexplored.
- Cross-architecture and cross-domain extension: Alignment of KV caches between disparate model families or under mixed-modal input is not fully resolved.
- Anchor/offset compression: Addressing memory inflation associated with anchor pools in KVCOMM remains an active area.
- Privacy and security considerations: Sharing or communicating KV caches exposes model-internal representations, presenting potential information leakage risks.
- Integration with quantization and external-memory paging: Compound cache-saving strategies require careful co-design to avoid compounding errors.
CollectiveKV research establishes theoretical and empirical foundations for principled, scalable, and cooperative cache management—enabling deployment of high-throughput inference systems and collaborative multi-agent or recommendation platforms at previously inaccessible scale. Continuing work seeks to unify joint encoding, cooperative budgeting, and cross-session adaptation into a general and efficient framework for context-aware memory management in deep sequence models (Li et al., 27 Jan 2026, Sun et al., 21 Feb 2025, Lin et al., 3 Dec 2025, Kampeas et al., 6 Jan 2026, Ye et al., 14 Oct 2025, Shi et al., 2 Oct 2025).